the H0- hypothesis is, that there is no difference in drinking behaviour between the regions in US.
Categorical Variables:
1. 337-337 S2AQ5A DRANK ANY BEER IN LAST 12 MONTHS
18346 1. Yes 8562 2. No 38 9. Unknown 16147 BL. NA, former drinker or lifetime abstainer
2. 41-41 REGION CENSUS REGION
8209 1. Northeast 8991 2. Midwest 16156 3. South 9737 4. West
Program:
A crosstable was generated by panda.crosstab- fuction with 2 categoric variables 'S2AQ5A' and 'Region'.
The CHI- square and p- value was calculated by scipy.stats.chi2 - function for all regions and 3 post-hoc tests with 2 by 2 - comparsions.
Result:
The H0- hypothesis has to be rejected (p<<0,05) for comparing all regions and for 2 post-hoc tests.
Only for the post-hoc comparision between West amd Midwest the H0 can be acctepted (P>0,05).
-------------------------------------------------------------------------------------------------------
Program:
import os
import pandas
import numpy
import scipy.stats
# define individual name of dataset
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
# 2x categorical Variables used in program
# 337-337 S2AQ5A DRANK ANY BEER IN LAST 12 MONTHS
# 18346 1. Yes 8562 2. No 38 9. Unknown 16147 BL. NA, former drinker or lifetime abstainer
# 41-41 REGION CENSUS REGION
# 8209 1. Northeast 8991 2. Midwest 16156 3. South 9737 4. West
# recode missing values to python missing (NaN)
data['S2AQ5A']=data['S2AQ5A'].replace('9', numpy.nan) # defined as char
data['S2AQ5A']=data['S2AQ5A'].replace(' ', numpy.nan) # needed before set to numeric
# new code setting variables you will be working with to numeric
data['S2AQ5A'] = pandas.to_numeric(data['S2AQ5A'], errors='coerce')
# data subset only for needed columns
sub1 = data[['REGION','S2AQ5A']]
#make a copy of my new subsetted data Remove NaN
sub2 = sub1.copy()
sub2 = sub2.dropna()
recode2 = {1: "Northeast", 2: "Midwest", 3: "South", 4: "West"}
sub2['REGION']= sub2['REGION'].map(recode2)
print(len(sub2)) # No. of rows
print(len(sub2.columns))
print(sub2.head(5))
# contingency (Häufigkeits)- table of observed counts for both variables
ct1=pandas.crosstab(sub2['S2AQ5A'], sub2['REGION'])
print('----- Cross table -------')
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print('----- in % -------')
print(colpct)
# chi-square
print ()
print ('chi-square value, p value, df, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
print ()
print ('....... Post Hoc Compare 1 NW - MW .........')
recode3 = {"Northeast": "Northeast", "Midwest": "Midwest" }
sub2['COMP1']= sub2['REGION'].map(recode3)
# contingency table of observed counts
ct2=pandas.crosstab(sub2['S2AQ5A'], sub2['COMP1'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, df, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
print ()
print ('....... Post Hoc Compare 2 W - MW .........')
recode3 = {"West": "West", "Midwest": "Midwest" }
sub2['COMP2']= sub2['REGION'].map(recode3)
# contingency table of observed counts
ct2=pandas.crosstab(sub2['S2AQ5A'], sub2['COMP2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, df, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
print ()
print ('....... Post Hoc Compare 3 W - S .........')
recode3 = {"West": "West", "South": "South" }
sub2['COMP3']= sub2['REGION'].map(recode3)
# contingency table of observed counts
ct2=pandas.crosstab(sub2['S2AQ5A'], sub2['COMP3'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, df, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
-------------------------------------------------------------------------------
Output:
------------------------------------------------------------------------------
----- Cross table -------
REGION Midwest Northeast South West
S2AQ5A
1.0 4271 3535 6043 4497
2.0 1811 1953 2951 1847
----- in % -------
REGION Midwest Northeast South West
S2AQ5A
1.0 0.702236 0.644133 0.671892 0.708859
2.0 0.297764 0.355867 0.328108 0.291141
chi-square value, p value, df, expected counts
(73.07946685879071, 9.346703898504186e-16, 3, array([[4146.7359893 , 3741.74401665, 6132.1511818 , 4325.36881225],
[1935.2640107 , 1746.25598335, 2861.8488182 , 2018.63118775]]))
--> H0- hypotheses rejected: p= 9,3e-16 significant more drinker in Midwest & West than Northeast & South
....... Post Hoc Compare 1 Northeast - Midwest .........................................
COMP1 Midwest Northeast
S2AQ5A
1.0 4271 3535
2.0 1811 1953
COMP1 Midwest Northeast
S2AQ5A
1.0 0.702236 0.644133
2.0 0.297764 0.355867
chi-square value, p value, df, expected counts
(44.10875595370776, 3.106283333437607e-11, 1, array([[4103.37873812, 3702.62126188],
[1978.62126188, 1785.37873812]]))
--> H0- hypotheses rejected: p= 3,1e-11 significant more drinker in Midwest than Northeast
....... Post Hoc Compare 2 West - Midwest ........................................
COMP2 Midwest West
S2AQ5A
1.0 4271 4497
2.0 1811 1847
COMP2 Midwest West
S2AQ5A
1.0 0.702236 0.708859
2.0 0.297764 0.291141
chi-square value, p value, df, expected counts
(0.6241389726531966, 0.4295133634241102, 1, array([[4291.56413971, 4476.43586029],
[1790.43586029, 1867.56413971]]))
H0- hypotheses confirmed: p= 0,42 no difference in drinking behaviour between West and Midwest
....... Post Hoc Compare 3 West - South ..........................................
COMP3 South West
S2AQ5A
1.0 6043 4497
2.0 2951 1847
COMP3 South West
S2AQ5A
1.0 0.671892 0.708859
2.0 0.328108 0.291141
chi-square value, p value, df, expected counts
(23.476544240410604, 1.2644599462751704e-06, 1, array([[6180.51636458, 4359.48363542],
[2813.48363542, 1984.51636458]]))