Data analysis and interpretation-Week 2 - Writing your first program
I have chosen Python to start with my analysis.
Initially, I seemed to run this known issue with Spyder: https://github.com/spyder-ide/spyder/issues/2984. As I would have problems trying to render any image, I have decided to use iPhython notebooks instead. iPython notebooks are very convenient, because they allow me to add text between code snippets and figures so that I can better go back to them in the future.
The first thing I noticed when running Python for loading the dataset, is that the datafile seems to contain some errors, that is why I need to ignore them via forcing “error_bad_lines=False”:
data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False )
After trying to force the delimiter as “comma” with no luck, and, as this was a non-starter situation, I have decided to ignore these errors via forcing with the parameter “error_bad_lines=False”, in the read_csv function.
This produces the following warning for 9 countries, an consideration I will need to remember:
'Skipping line 43: expected 16 fields, saw 17\n Skipping line 44: expected 16 fields, saw 17\n Skipping line 85: expected 16 fields, saw 17\n Skipping line 101: expected 16 fields, saw 17\n Skipping line 102: expected 16 fields, saw 17\n Skipping line 114: expected 16 fields, saw 17\n Skipping line 115: expected 16 fields, saw 17\n Skipping line 127: expected 16 fields, saw 17\n Skipping line 212: expected 16 fields, saw 17\n'
This countries correspond to the following missing countries for the analysis:
Congo Rep., Costa Rica, Iceland, Kuwait, Kyrgyzstan, Madagascar, Malawi, Monaco and Zimbabwe.
After printing some general data about our dataset, these are the main stats:
- 202 number of observations (countries)
- 16 number of columns, whose headers are:
Index(['country', 'incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate'], dtype='object') 202
STARTING THE UNIVARIATE ANALYSIS
Next, we start with the UNIVARIATE analysis, that is, examining each variable separately.
First, we take a look at the distribution of those variables EMPLOYRATE, URBANRATE AND INTERNETUSERATE, so: which values they take and how often they take those values:
These frequency tables speak about the three observed variables: they are continuous variables that take values from 0 to 100 in a contuinuum of values. For some cases, there are some repeated values across countries, but normally they take singular values. There are some NaN or missing values that count for different variables.
In the case of EMPLOYRATE, there are 33 missing values. In the case of URBANRATE there are 9 missing values. In the case of INTERNETUSERATE there are 19 missing values. We will need to account for this during our analysis. If you pass a variable with many unique values to table(), such a numeric variable, it will still produce a table of counts for each unique value, but the counts may not be particularly meaningful:
c1 = data["EMPLOYRATE"].value_counts(sort=False) print(c1)
employment rate nan 33 50.500000 1 61.500000 3 64.500000 1 63.500000 1 56.500000 1 53.500000 3 81.500000 1 60.500000 1 42.500000 1 54.500000 2 49.500000 1 71.300003 1 52.500000 1 58.500000 1 57.500000 2 73.199997 1 63.799999 2 60.700001 1 51.400002 1 41.099998 1 75.199997 1 56.400002 1 70.400002 2 51.299999 1 50.700001 1 78.199997 2 46.200001 1 42.000000 1 71.800003 1 43.099998 1 46.000000 2 32.000000 1 51.000000 2 65.699997 1 56.900002 1 56.000000 2 46.799999 1 59.000000 1 61.000000 2 62.299999 1 65.000000 3 47.099998 1 64.300003 1 56.299999 2 71.000000 1 72.000000 1 66.900002 1 40.099998 1 76.000000 1 77.000000 1 68.300003 1 78.900002 1 53.099998 1 83.000000 1 73.099998 1 55.900002 3 54.599998 1 66.000000 2 48.700001 2 56.799999 2 68.099998 1 38.900002 1 74.699997 1 47.799999 1 68.000000 1 62.700001 1 46.900002 1 44.700001 1 42.799999 1 59.799999 1 41.200001 1 57.599998 1 46.400002 1 55.700001 1 71.599998 1 45.700001 1 49.599998 1 58.400002 2 66.800003 1 80.699997 1 60.900002 1 63.900002 1 58.200001 2 55.599998 1 65.599998 1 57.900002 1 62.400002 1 59.099998 2 57.200001 1 53.400002 2 50.900002 2 57.099998 1 59.299999 1 61.799999 1 72.800003 1 66.599998 1 63.200001 1 51.200001 2 55.099998 1 42.400002 2 57.299999 1 63.099998 1 55.400002 1 61.700001 1 44.200001 1 73.599998 1 83.199997 2 64.900002 1 81.300003 1 52.099998 1 47.299999 3 59.700001 1 59.900002 3 54.400002 1 58.799999 2 75.699997 1 65.900002 1 44.799999 1 58.900002 2 67.300003 1 71.699997 1 52.700001 1 63.700001 1 65.099998 1 44.299999 1 48.599998 2 60.400002 2 79.800003 1 68.900002 1 41.599998 1 61.299999 1 37.400002 1 dtype: int64
p1 = data["EMPLOYRATE"].value_counts(sort=False, normalize=True) print (p1)
percentage employment rate nan 0.163366 50.500000 0.004950 61.500000 0.014851 64.500000 0.004950 63.500000 0.004950 56.500000 0.004950 53.500000 0.014851 81.500000 0.004950 60.500000 0.004950 42.500000 0.004950 54.500000 0.009901 49.500000 0.004950 71.300003 0.004950 52.500000 0.004950 58.500000 0.004950 57.500000 0.009901 73.199997 0.004950 63.799999 0.009901 60.700001 0.004950 51.400002 0.004950 41.099998 0.004950 75.199997 0.004950 56.400002 0.004950 70.400002 0.009901 51.299999 0.004950 50.700001 0.004950 78.199997 0.009901 46.200001 0.004950 42.000000 0.004950 71.800003 0.004950 43.099998 0.004950 46.000000 0.009901 32.000000 0.004950 51.000000 0.009901 65.699997 0.004950 56.900002 0.004950 56.000000 0.009901 46.799999 0.004950 59.000000 0.004950 61.000000 0.009901 62.299999 0.004950 65.000000 0.014851 47.099998 0.004950 64.300003 0.004950 56.299999 0.009901 71.000000 0.004950 72.000000 0.004950 66.900002 0.004950 40.099998 0.004950 76.000000 0.004950 77.000000 0.004950 68.300003 0.004950 78.900002 0.004950 53.099998 0.004950 83.000000 0.004950 73.099998 0.004950 55.900002 0.014851 54.599998 0.004950
urbanization rate nan 9 74.500000 1 73.500000 1 67.500000 1 26.460000 1 66.500000 1 87.300000 1 52.040000 1 71.100000 1 85.580000 1 73.480000 1 41.000000 1 17.000000 1 60.180000 1 92.680000 1 51.640000 1 32.580000 1 23.000000 1 52.740000 1 61.000000 1 86.960000 1 12.980000 1 73.920000 1 29.540000 1 82.440000 1 33.320000 1 15.100000 1 64.920000 1 57.180000 1 39.380000 1 14.320000 1 93.160000 1 42.000000 1 73.200000 1 77.540000 1 77.480000 1 48.360000 1 18.800000 1 54.340000 1 37.760000 1 69.460000 1 29.520000 1 70.360000 1 81.820000 1 43.840000 1 48.780000 1 56.420000 1 59.620000 1 95.640000 1 69.900000 1 88.920000 1 92.000000 1 77.360000 1 37.860000 1 68.460000 1 35.420000 1 12.540000 1 85.040000 1 24.760000 1 47.440000 1 38.580000 1 24.940000 1 21.600000 1 39.840000 1 71.620000 1 93.320000 1 29.840000 1 34.440000 1 47.880000 1 46.720000 1 50.020000 1 56.700000 1 34.480000 1 27.300000 1 60.560000 1 26.680000 1 27.840000 2 30.880000 1 16.540000 1 77.120000 1 51.460000 1 25.460000 1 77.200000 1 74.920000 1 19.560000 1 53.300000 1 10.400000 1 36.820000 1 24.780000 1 68.080000 1 36.280000 1 73.460000 1 77.880000 1 46.840000 1 73.640000 1 65.220000 1 84.540000 1 41.760000 1 74.820000 1 72.840000 1 42.380000 1 18.340000 1 86.560000 1 69.020000 1 36.160000 1 100.000000 4 80.460000 1 36.520000 1 92.300000 1 61.340000 1 88.520000 1 42.720000 1 51.920000 1 25.520000 1 65.580000 2 81.700000 1 21.560000 1 88.740000 1 48.600000 1 98.320000 1 71.900000 1 17.240000 1 13.220000 1 68.680000 1 32.180000 1 91.660000 1 32.320000 1 83.700000 1 27.140000 1 47.040000 1 36.840000 2 52.360000 1 59.460000 1 28.380000 1 71.400000 1 56.740000 1 94.260000 1 97.360000 1 42.480000 1 75.660000 1 24.040000 1 41.200000 1 46.780000 1 37.340000 1 30.460000 1 51.700000 1 83.520000 1 48.580000 1 61.320000 1 60.740000 1 63.860000 1 30.840000 1 68.120000 1 20.720000 1 88.440000 1 48.620000 1 86.680000 1 64.780000 1 82.420000 1 80.400000 1 28.080000 1 67.980000 1 57.940000 1 59.580000 1 60.300000 1 60.700000 1 17.960000 1 56.560000 1 56.020000 1 66.480000 1 43.440000 1 67.160000 1 66.600000 1 57.280000 1 66.960000 1 54.220000 1 43.100000 1 92.260000 1 71.080000 1 98.360000 1 63.300000 1 60.140000 1 94.220000 1 41.420000 1 89.940000 1 78.420000 1 54.240000 1 56.760000 1 dtype: int64
percentage urbanization rate nan 0.044554 74.500000 0.004950 73.500000 0.004950 67.500000 0.004950 26.460000 0.004950 66.500000 0.004950 87.300000 0.004950 52.040000 0.004950 71.100000 0.004950 85.580000 0.004950 73.480000 0.004950
3) Variable INTERNETUSERATE
internet use rate nan 19 1.400061 1 39.820178 1 74.163040 1 2.199998 1 56.300034 1 9.196775 1 8.370207 1 90.016190 1 69.339971 1 2.699966 1 77.638535 1 13.000111 1 43.055067 1 51.280478 1 9.549931 1 40.772851 1 62.471230 1 2.450362 1
percentage internet use rate nan 0.094059 1.400061 0.004950 39.820178 0.004950 74.163040 0.004950 2.199998 0.004950 56.300034 0.004950 9.196775 0.004950 8.370207 0.004950 90.016190 0.004950 69.339971 0.004950 2.699966 0.004950 77.638535 0.004950 13.000111 0.004950 43.055067 0.004950 51.280478 0.004950 9.549931 0.004950 40.772851 0.004950 62.471230 0.004950
Main stats for these three variables:
data_valid_employrate.describe()
count 169.000000 mean 58.746746 std 10.490075 min 32.000000 25% 51.200001 50% 58.400002 75% 65.000000 max 83.199997
data_valid_urbanrate.describe()
count 193.000000 mean 56.483938 std 23.707742 min 10.400000 25% 36.840000 50% 57.180000 75% 73.920000 max 100.000000
data_valid_interentuserate.describe()
count 183.000000 mean 35.540207 std 27.758392 min 0.210066 25% 9.999254 50% 31.568098 75% 55.646421 max 95.638113
Derived variables (this is an advance of week 3):
We can see that there are no values we want to restrict in our set that may not have been restricted yet (the missing countries or with NA values), as in the class videos example.
We could only try to create a secondary variable that tries to subdivide the urban rate into different tiers or levels of urbanization.
Now, we will proceed to create a secondary derived variable called "URBANIZATIONGROUP" for categorizing the level of urbanization of a country. We will assign 4 groups, corresponding to the 4 quartiles of the variable URBANRATE:
UrbLevel1: Countries having 30 or less percent of population living in urban areas
UrbLevel2: Countries having between 25 and 75 percent of population living in urban areas.
UrbLevel3: Countries having more than 75 percent of population living in urban areas.
These are the counts for each of these three levels in our dataset:
UrbLevel1 31 UrbLevel2 114 UrbLevel3 48
UrbLevel1 0.160622 UrbLevel2 0.590674 UrbLevel3 0.248705
To see if these levels have been assigned correctly, we execute the crosstab
function with these two variables:
print(pandas.crosstab(x2_valid_data_urban['URBANIZATIONGROUP'], x2_valid_data_urban['URBANRATE']))
THIS IS THE PYTHON CODE (I am using Python 2.7):
#convert variables to numeric
data["EMPLOYRATE"]=data["EMPLOYRATE"]
.convert_objects(convert_numeric=True)
data["URBANRATE"]=data["URBANRATE"]
.convert_objects(convert_numeric=True)
data["INTERNETUSERATE"]=data["INTERNETUSERATE"]
.convert_objects(convert_numeric=True)
#get frequency counts (tables):
print("employment rate") c1 = data["EMPLOYRATE"].value_counts(sort=False, dropna=False) print(c1)
#in order to ask for percentages of each value based on those counts: print("percentage employment rate") p1 = data["EMPLOYRATE"].value_counts(sort=False, normalize=True, dropna= False) print (p1)
print("urbanization rate") c2 = data["URBANRATE"].value_counts(sort=False, dropna=False) print(c2) #in order to ask for percentages of each value based on those counts: print("percentage urbanization rate") p2 = data["URBANRATE"].value_counts(sort=False, normalize=True, dropna=False) print (p2)
print("internet use rate") c3 = data["INTERNETUSERATE"].value_counts(sort=False, dropna=False) print(c3) #in order to ask for percentages of each value based on those counts: print("percentage internet use rate") p3 = data["INTERNETUSERATE"].value_counts(sort=False, normalize=True, dropna=False) print (p3)
#get frequency tables using crosstab functions:
freq_table_employrate = pandas.crosstab(index=data["EMPLOYRATE"], columns="count") freq_table_employrate
freq_table_urbanrate = pandas.crosstab(index=data["URBANRATE"], columns="count") freq_table_urbanrate
freq_table_internetuserate = pandas.crosstab(index=data["INTERNETUSERATE"], columns="count") freq_table_internetuserate
freq_table_employrate/freq_table_employrate.sum()
freq_table_urbanrate/freq_table_urbanrate.sum()
freq_table_internetuserate/freq_table_internetuserate.sum()
#Next, we create new variable URBANIZATIONGROUP to categorize countries based on their percentage of urban population print('URBANIZATIONGROUP - 3 categories - custom groups based on thresholds')
x_valid_data_urban.loc[:,('URBANIZATIONGROUP')]= pandas.cut(x_valid_data_urban['URBANRATE'], [0, 29, 74, 100 ], labels=["UrbLevel1", "UrbLevel2", "UrbLevel3"])
x2_valid_data_urban=x_valid_data_urban.copy()
c5 = x2_valid_data_urban['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True) print(c5)
p5 = x2_valid_data_urban['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True, normalize=True) print(p5)
x2_valid_data_urban.describe()