Chi square tests - Addhealth dataset
For a best practice of this current assessment, I chose to have new variables. So it will be more clearer to have C->C relation. (the condition for Chi Square test of independence)
Note : the python program is at the end of this blog.
For this assessment, I was interested to examine the association between “the general health of teens” [H1GH1 = GENERAL_HEALTH] and the status that teens was taught at school about "where to go for help with a health problem?” [H1TS15 = TAUGHT_IN_SCHOOL], and “the racial background of teens” [H1GI8 = RACIAL_BACK].
Categorical response variable
[GENERAL_HEALTH] : In general, how is your health? Would you say… (excellent | very good | good | fair | poor)
Categorical explanatory variables
[TAUGHT_IN_SCHOOL] : Taught in School : "where to go for help with a health problem?” (for 2 levels) [1:yes 0: no]
[RACIAL_BACK] : Which one category best describes your racial background? (for more than 2 levels) [1:White 2:Black or African American 3:American Indian or Native American 4:Asian or Pacific Islander 5:Other ]
1) Model of interpretation of X2 t-o-i (2 levels)
After running the Chi square test of independence for the explanatory variable TAUGHT_IN_SCHOOL (categorical with 2 levels) and the response variable GENERAL_HEALTH (categorical), we have this following result :
And the graph proportion gives this :
We have a p-value = 0.634 > 0.05 here, so we can say that the Ho hypothesis cannot be rejected. It means, the 2 variables are independent variables.
2) Model of interpretation Post hoc tests (more than 2 levels)
We change here the explanatory variable to a new one “RACIAL_BACK” (categorical level 5). the Bonferroni Adjustment tells us the value of “p/c" c = combinaison of 2 variables among 5 = 5! / 2! (5 - 2) ! = 5x4/2 = 10 p/c = 0.05/10 = 0.005
The general X2 test gives this output :
This result cannot provide us a good interpretation, because we need to do a post hoc tests for this case.
a) Compare group 1 to 2 :
p-value = 0.02 > p/c = 0.005 This means the 2 groups are identical
b) Compare group 1 to 3 :
p-value = 0.98 > p/c = 0.005. This means the 2 groups are also identical
c) Compare group 1 to 4 :
p-value = 0.54 > p/c = 0.005 This means the 2 groups are also identical
d) Compare group 1 to 5 :
p-value = 0.93 > p/c = 0.005 This means the 2 groups are also identical
We see here that all the 5 groups are identical or "without a significant dependence”. Unfortunately, we don't have any case with p-value < 0.005. So our test is terminated.
This conclusion is confirmed with this percentage graph here-below :
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('../../datasets & codebooks/addhealth_pds.csv', low_memory=False)
#setting variables you will be working with to numeric
# Explanatory variables : C
# 2 levels
# H1TS15 : Taught in School : where to go for help with a health problem
data['H1TS15'] = pandas.to_numeric(data['H1TS15'], errors='coerce')
# more than 2 levels
# H1GI8 : Which one category best describes your racial background?
data['H1GI8'] = pandas.to_numeric(data['H1GI8'], errors='coerce')
# Response variable : C
# H1GH1 : In general, how is your health? Would you say… (excellent | very good | good | fair | poor)
data['H1GH1'] = pandas.to_numeric(data['H1GH1'], errors='coerce')
#SETTING MISSING DATA & RENAME VARIABLE
data['H1TS15']=data['H1TS15'].replace(6, numpy.nan)
data['TAUGHT_IN_SCHOOL']=data['H1TS15'].replace(8, numpy.nan)
data['H1GI8']=data['H1GI8'].replace(6, numpy.nan)
data['H1GI8']=data['H1GI8'].replace(7, numpy.nan)
data['H1GI8']=data['H1GI8'].replace(8, numpy.nan)
data['RACIAL_BACK']=data['H1GI8'].replace(9, numpy.nan)
data['H1GH1']=data['H1GH1'].replace(6, numpy.nan)
data['GENERAL_HEALTH']=data['H1GH1'].replace(8, numpy.nan)
#Subset 1 : my needed variables
sub1 = data[['TAUGHT_IN_SCHOOL', 'RACIAL_BACK', 'GENERAL_HEALTH']].dropna()
print ('############# simple X2 t-o-i (2 levels) #############')
# contingency table of observed counts
ct0=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['TAUGHT_IN_SCHOOL'])
print (ct0)
# column percentages
colsum0=ct0.sum(axis=0)
colpct0=ct0/colsum0
print(colpct0)
# chi-square
print ('chi-square value, p value, expected counts')
cs0= scipy.stats.chi2_contingency(ct0)
print (cs0)
sub1["TAUGHT_IN_SCHOOL"] = sub1["TAUGHT_IN_SCHOOL"].astype('category')
#0 : no
#1 : yes
# graph percent
seaborn.factorplot(x="TAUGHT_IN_SCHOOL", y="GENERAL_HEALTH", data=sub1, kind="bar", ci=None)
plt.xlabel('Taught in school')
plt.ylabel('General health')
print ('############# post hoc (more than 2 levels) #############')
# contingency table of observed counts
ct1=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['RACIAL_BACK'])
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
sub1["RACIAL_BACK"] = sub1["RACIAL_BACK"].astype('category')
#1:White
#2:Black or African American
#3:American Indian or Native American
#4:Asian or Pacific Islander
#5:Other
# graph percent
seaborn.factorplot(x="RACIAL_BACK", y="GENERAL_HEALTH", data=sub1, kind="bar", ci=None)
plt.xlabel('Racial background')
plt.ylabel('General health')
recode2 = {1: 1, 2: 2}
sub1['COMP1v2']= sub1['RACIAL_BACK'].map(recode2)
# contingency table of observed counts
ct2=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
recode3 = {1: 1, 3: 3}
sub1['COMP1v3']= sub1['RACIAL_BACK'].map(recode3)
# contingency table of observed counts
ct3=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v3'])
print (ct3)
# column percentages
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
recode4 = {1: 1, 4: 4}
sub1['COMP1v4']= sub1['RACIAL_BACK'].map(recode4)
# contingency table of observed counts
ct4=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v4'])
print (ct4)
# column percentages
colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)
recode5 = {1: 1, 5: 5}
sub1['COMP1v5']= sub1['RACIAL_BACK'].map(recode5)
# contingency table of observed counts
ct5=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v5'])
print (ct5)
# column percentages
colsum=ct5.sum(axis=0)
colpct=ct5/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs5= scipy.stats.chi2_contingency(ct5)
print (cs5)