Discover Top Posts Tagged with #statistics data_analysis chi_square python

Aug 10, 2020

The effect of average income on the alcohol consumption rate per person

A study have been conducted using the data set gathered by Gapminder organization for data of more than 150 countries obtained from the UN. To study the effect of average income on the alcohol consumption rate.

In order to perform the analysis where the input is categorical (income below or above $2000 USD, denoted by 0 and 1) and the output is also categorical regarding the alcohol consumption rate (below 3 liters, 3 liters to 9 liters, and above 9 liters) denoted in 0,1 and 2

so far, it seems graphically that the countries with average income below $2000 are having a higher alcohol consumption rate.

to prove this theory , a chi- squared test need to be provided for the categorical input and output data.

the results of chi square was as following.

the p value was equal to 1.24*10^-9, indicating the significant difference and thus rejecting the null hypothesis that says there is no significant relationship between average income and alcohol consumption.

later on , w need to identify which category of alcohol usage is significantly affected by the average income, thus we need to used a post hoc test of bonferoni , so we will compare the p value with 0.016which is divided the number of categories which is 3 .

and thus we will perform chi square comparison between the 3 alcohol levels

the result will be as shown

there was a significant difference between the first and third and the second and third category , which is shown by the low p-value with was far smaller than the Bonferonni probability calculated, And thus, rejecting the null hypothesis that says there is no significant difference between alcohol levels of the first and third category and the second and Third category. indicating that we can divide into two categories , below 9 liters and more that 9 liters of alcohol consumption

In conclusion , the lower average income in a country below 2000 USD is more likely to have an increase in the alcohol consumption rate for drinking by more than 9 liter per person.

Source code for python

""" Created on Wed Aug 26 18:17:30 2015

@author: Omar Elfarouk """

import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

# new code setting variables you will be working with to numeric data['Alcoholuse'] = pandas.to_numeric(data['Alcoholuse'], errors='coerce') data['Income'] = pandas.to_numeric(data['Income'], errors='coerce') data['suicideper100th'] = pandas.to_numeric(data['suicideper100th'], errors='coerce')

#subset data the excel sheet available , and data wrangling sub1=data sub1 = sub1.replace(r'\s+', 0, regex=True) #Replace empty strings with zero #sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]

#make a copy of my new subsetted data sub2 = sub1.copy()

# recode missing values to python missing (NaN) #sub2['Alcoholuse']=sub2['Alcoholuse'].replace(0, numpy.nan) #sub2['Income']=sub2['Income'].replace(0, numpy.nan) #sub2['suicideper100th']=sub2['suicideper100th'].replace(0, numpy.nan) #recoding values for S3AQ3B1 into a new variable, USFREQMO #recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1} #sub2['USFREQMO']= sub2['S3AQ3B1'].map(recode1)

# contingency table of observed counts ct1=pandas.crosstab(sub2['Alcoholuse'], sub2['Income']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

# set variable types sub2["Income"] = sub2["Income"].astype('category') # new code for setting variables to numeric: sub2['Alcoholuse'] = pandas.to_numeric(sub2['Alcoholuse'], errors='coerce')

# old code for setting variables to numeric: #sub2['TAB12MDX'] = sub2['TAB12MDX'].convert_objects(convert_numeric=True)

# graph percent with nicotine dependence within each smoking frequency group seaborn.factorplot(x="Income", y="Alcoholuse", data=sub2, kind="bar", ci=None) plt.xlabel('Income below or greater than 2000USD') plt.ylabel('Alcohol consumption rate')

recode2 = {0: 0, 1: 1} sub2['COMP1v1']= sub2['Alcoholuse'].map(recode2)

# contingency table of observed counts ct2=pandas.crosstab(sub2['Income'], sub2['COMP1v1']) print (ct2)

# column percentages for creating the column sum colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

recode3 = {1: 1, 2: 2} sub2['COMP1v2']= sub2['Alcoholuse'].map(recode3)

# contingency table of observed counts ct3=pandas.crosstab(sub2['Income'], sub2['COMP1v2']) print (ct3)

# column percentages for creating the column sum colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs3= scipy.stats.chi2_contingency(ct3) print (cs3)

recode4 = {0: 0, 2: 2} sub2['COMP1v14']= sub2['Alcoholuse'].map(recode4)

# contingency table of observed counts ct4=pandas.crosstab(sub2['Income'], sub2['COMP1v14']) print (ct4)

# column percentages for creating the column sum colsum=ct4.sum(axis=0) colpct=ct4/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs4= scipy.stats.chi2_contingency(ct4) print (cs4)

#statistics data_analysis chi_square python

omarelfarouk90