Arpagon Data Analysis @arpagondata - Tumblr Blog

Running a k-means Cluster Analysis

A k-means cluster analysis was conducted to identify underlying subgroups of Countries based on their similarity of on 11 macro-economic variables that represent characteristics that could have an impact on politycscore. Clustering variables included

'internetuserate', 'incomeperperson', 'urbanrate', 'alcconsumption', 'armedforcesrate', 'co2emissions', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate'

All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=40) and a test set that included 30% of the observations (N=18). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

**Figure 1. Elbow curve of r-square values for the nine cluster solutions

The elbow curve was inconclusive, suggesting that the 2, 3 solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.

Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below)

**Figure 2. Scatterplot of Canonical Variables for 3 Clusters

Cluster Analysis

Cluster 0

The means on the clustering variables showed that, compared to the other clusters, Countries in cluster 1 had moderate levels on the clustering variables. They are misnamed “Developing country” on .

**Cluster 1

cluster 2 had higher levels on the clustering variables compared to cluster 1, Are the misnamed “Heavily indebted poor countries”

Cluster 3

cluster 3 had higher levels on the clustering variables compared to cluster 1, Are the misnamed** “Developed country”**

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on politycscore. A tukey test was used for post hoc comparisons between the clusters.

Results indicated significant differences between the clusters on** **politycscore.

cluster 0 5.000000 1 0.777778 2 8.714286 standard deviations for polityscore by cluster polityscore cluster 0 6.442049 1 7.758508 2 4.810702

Code:

## ## #!/usr/bin/env python3 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week3.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Running a k-means Cluster Analysis polityscore On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import matplotlib.pylab as plt from pandas import Series, DataFrame from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans from sklearn.decomposition import PCA #%%# Data Managment 1 data = pandas.read_csv("../gapminder.csv", low_memory=False) #Preapare Dataset data['polityscore']=pandas.to_numeric(data['polityscore'], errors='coerce') data['internetuserate']=pandas.to_numeric(data['internetuserate'], errors='coerce') data['incomeperperson']=pandas.to_numeric(data['incomeperperson'], errors='coerce') data['urbanrate']=pandas.to_numeric(data['urbanrate'], errors='coerce') data['alcconsumption']=pandas.to_numeric(data['alcconsumption'], errors='coerce') data['armedforcesrate']=pandas.to_numeric(data['armedforcesrate'], errors='coerce') data['co2emissions']=pandas.to_numeric(data['co2emissions'], errors='coerce') data['lifeexpectancy']=pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['oilperperson']=pandas.to_numeric(data['oilperperson'], errors='coerce') data['relectricperperson']=pandas.to_numeric(data['relectricperperson'], errors='coerce') data['suicideper100th']=pandas.to_numeric(data['suicideper100th'], errors='coerce') data['employrate']=pandas.to_numeric(data['employrate'], errors='coerce') print(''' Chosee the Countrys whit a unperfect democracy as comparation point ''') data['polityscore2']=pandas.cut(data.polityscore, [-10, 9 , 10], labels=[0,1]) data['polityscore2']=pandas.to_numeric(data['polityscore2'], errors='coerce') data_clean = data.dropna() data_clean.dtypes data_clean.describe() print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nTarget Variable chosen for the study are: polityscore') print('Cluster variables for the assignment:', 'internetuserate', 'incomeperperson', 'urbanrate', 'alcconsumption', 'armedforcesrate', 'co2emissions', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate') #%%# Split into training and testing sets cluster = data_clean[['internetuserate', 'incomeperperson', 'urbanrate', 'alcconsumption', 'armedforcesrate', 'co2emissions', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate']] print(cluster.describe()) clustervar=cluster.copy() clustervar['internetuserate']=preprocessing.scale(clustervar['internetuserate'].astype('float64')) clustervar['incomeperperson']=preprocessing.scale(clustervar['incomeperperson'].astype('float64')) clustervar['urbanrate']=preprocessing.scale(clustervar['urbanrate'].astype('float64')) clustervar['alcconsumption']=preprocessing.scale(clustervar['alcconsumption'].astype('float64')) clustervar['armedforcesrate']=preprocessing.scale(clustervar['armedforcesrate'].astype('float64')) clustervar['co2emissions']=preprocessing.scale(clustervar['co2emissions'].astype('float64')) clustervar['lifeexpectancy']=preprocessing.scale(clustervar['lifeexpectancy'].astype('float64')) clustervar['oilperperson']=preprocessing.scale(clustervar['oilperperson'].astype('float64')) clustervar['relectricperperson']=preprocessing.scale(clustervar['relectricperperson'].astype('float64')) clustervar['suicideper100th']=preprocessing.scale(clustervar['suicideper100th'].astype('float64')) clustervar['employrate']=preprocessing.scale(clustervar['employrate'].astype('float64')) # split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123) print(clus_train.shape) print(clus_test.shape) #%%# k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[] for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0]) #%%# Plot average distance from observations from the cluster centroid """ Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """ plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') plt.show() plt.close() #%%# Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train) #%%# plot clusters pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show() plt.close() #%%# merge cluster assignment ''' multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster ''' # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) print("Combined index variable list with cluster assignment") print(newlist) # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') print("New Cluster") print(newclus) # rename the cluster assignment column newclus.columns = ['cluster'] # now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pandas.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts() #%%# calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp) # validate clusters in training data by examining cluster differences in polityscore using Gapinder # first have to merge polityscore with clustering variables and cluster assignment data polityscore_data=data_clean['polityscore'] # split polityscore data into train and test sets polityscore_train, polityscore_test = train_test_split(polityscore_data, test_size=.3, random_state=123) polityscore_train1=pandas.DataFrame(polityscore_train) polityscore_train1.reset_index(level=0, inplace=True) merged_train_all=pandas.merge(polityscore_train1, merged_train, on='index') sub1 = merged_train_all[['polityscore', 'cluster']].dropna() polityscoremod = smf.ols(formula='polityscore ~ C(cluster)', data=sub1).fit() print (polityscoremod.summary()) print ('means for polityscore by cluster') m1= sub1.groupby('cluster').mean() print (m1) print ('standard deviations for polityscore by cluster') m2= sub1.groupby('cluster').std() print (m2) mc1 = multi.MultiComparison(sub1['polityscore'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())

Result:

Python 3.4.3+ (default, Oct 14 2015, 16:03:50) Type "copyright", "credits" or "license" for more information. IPython 2.3.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. %guiref -> A brief reference about the graphical user interface. In [1]: runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Machine Learning/week4.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Machine Learning') Chosee the Countrys whit a unperfect democracy as comparation point Dataset gapminder Obeservations: 213 Variables: 17 Target Variable chosen for the study are: polityscore Cluster variables for the assignment: internetuserate incomeperperson urbanrate alcconsumption armedforcesrate co2emissions lifeexpectancy oilperperson relectricperperson suicideper100th employrate internetuserate incomeperperson urbanrate alcconsumption count 58.000000 58.000000 58.000000 58.000000 mean 51.249216 12444.391557 67.865862 9.307931 std 26.598757 12335.594804 16.466697 5.232932 min 2.199998 558.062877 27.140000 0.050000 25% 32.384640 2498.678807 60.805000 6.192500 50% 46.333146 5869.642345 68.570000 9.870000 75% 77.097878 24657.106188 77.780000 12.945000 max 93.277508 39972.352768 100.000000 19.150000 armedforcesrate co2emissions lifeexpectancy oilperperson count 58.000000 5.800000e+01 58.000000 58.000000 mean 1.260754 1.631592e+10 75.265862 1.275188 std 1.057948 4.610825e+10 5.839142 1.682150 min 0.287892 2.262553e+08 52.797000 0.032281 25% 0.534841 1.890118e+09 73.013250 0.461774 50% 0.943844 3.852409e+09 75.539000 0.867870 75% 1.641310 1.053362e+10 80.521250 1.562843 max 6.394936 3.342209e+11 83.394000 12.228645 relectricperperson suicideper100th employrate count 58.000000 58.000000 58.000000 mean 1543.790545 10.956465 57.536207 std 1887.521465 6.948989 7.398879 min 68.115229 1.380965 41.099998 25% 487.615567 5.920024 52.650000 50% 875.419623 9.993177 58.450001 75% 1858.100384 13.537748 62.099999 max 11154.755033 33.341860 75.199997 (40, 11) (18, 11) /home/arpagon/.local/lib/python3.4/site-packages/sklearn/preprocessing/data.py:167: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features. warnings.warn("Numerical issues were encountered " /usr/lib/python3/dist-packages/matplotlib/collections.py:571: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if self._edgecolors == str('face'): Combined index variable list with cluster assignment {96: 0, 139: 2, 197: 1, 6: 0, 72: 0, 201: 2, 202: 2, 11: 0, 205: 1, 207: 0, 144: 2, 152: 1, 146: 1, 84: 0, 86: 1, 25: 0, 88: 1, 153: 0, 154: 0, 90: 2, 69: 2, 159: 0, 32: 2, 16: 0, 100: 0, 9: 2, 39: 0, 178: 1, 174: 0, 136: 2, 50: 2, 179: 0, 54: 0, 55: 1, 184: 2, 185: 2, 124: 0, 10: 2, 190: 1, 63: 2} New Cluster 0 96 0 139 2 197 1 6 0 72 0 201 2 202 2 11 0 205 1 207 0 144 2 152 1 146 1 84 0 86 1 25 0 88 1 153 0 154 0 90 2 69 2 159 0 32 2 16 0 100 0 9 2 39 0 178 1 174 0 136 2 50 2 179 0 54 0 55 1 184 2 185 2 124 0 10 2 190 1 63 2 Clustering variable means by cluster index internetuserate incomeperperson urbanrate cluster 0 97.235294 -0.122710 -0.438748 0.198336 1 144.111111 -1.335951 -0.868176 -1.289793 2 108.142857 1.198233 1.251260 0.625008 alcconsumption armedforcesrate co2emissions lifeexpectancy cluster 0 0.596728 0.225199 -0.205747 -0.051477 1 -1.022763 -0.037247 -0.190892 -1.369818 2 0.240118 -0.541950 -0.067069 0.898356 oilperperson relectricperperson suicideper100th employrate cluster 0 -0.214412 -0.359952 0.114257 -0.077916 1 -0.463567 -0.602981 -0.232547 -0.471509 2 0.450235 1.121878 -0.145282 0.594942 OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.194 Model: OLS Adj. R-squared: 0.151 Method: Least Squares F-statistic: 4.460 Date: Thu, 05 May 2016 Prob (F-statistic): 0.0184 Time: 14:48:28 Log-Likelihood: -128.52 No. Observations: 40 AIC: 263.0 Df Residuals: 37 BIC: 268.1 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept 5.0000 1.516 3.297 0.002 1.927 8.073 C(cluster)[T.1] -4.2222 2.577 -1.638 0.110 -9.445 1.000 C(cluster)[T.2] 3.7143 2.257 1.646 0.108 -0.858 8.286 ============================================================================== Omnibus: 9.265 Durbin-Watson: 2.187 Prob(Omnibus): 0.010 Jarque-Bera (JB): 8.831 Skew: -1.135 Prob(JB): 0.0121 Kurtosis: 3.381 Cond. No. 3.45 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for polityscore by cluster polityscore cluster 0 5.000000 1 0.777778 2 8.714286 standard deviations for polityscore by cluster polityscore cluster 0 6.442049 1 7.758508 2 4.810702 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================== group1 group2 meandiff lower upper reject ---------------------------------------------- 0 1 -4.2222 -10.5143 2.0698 False 0 2 3.7143 -1.7944 9.2229 False 1 2 7.9365 1.4153 14.4578 True ---------------------------------------------- In [2]:

Running a Lasso Regression Analysis for polityscore

A lasso regression analysis was conducted to identify a subset of variables from a pool of 13 quantitative predictor variables that best predicted a quantitative response variable measuring Polityscore in 213 Countries. Categorical predictors included

internetuserate

incomeperperson

urbanrate

alcconsumption

armedforcesrate

co2emissions

femaleemployrate

hivrate

lifeexpectancy

oilperperson

relectricperperson

suicideper100th

employrate

All predictor variables were standardized to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (N=213) and a test set that included 30% of the observations (N=63). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1 Change in the validation mean square error at each step

Of the 13 predictor variables, 3 were retained in the selected model. During the estimation process, internetuserate were most strongly positive associated with polityscore, and followed by armedforcesrate were most strongly negatively associated with polityscore. Other predictors associated with greater polityscore included incomeperperson,

Figure 2 Regression Coefficients Progression for Lasso Paths

These 3 variables accounted for 40.4% of the variance in the polityscore response variable.

Code

#!/usr/bin/env python3 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week3.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Running a Lasso Regression Analysis polityscore On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV from sklearn import preprocessing #%%# Data Managment 1 data = pandas.read_csv("../gapminder.csv", low_memory=False) #Preapare Dataset data['polityscore']=pandas.to_numeric(data['polityscore'], errors='coerce') data['incomeperperson']=pandas.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption']=pandas.to_numeric(data['alcconsumption'], errors='coerce') data['armedforcesrate']=pandas.to_numeric(data['armedforcesrate'], errors='coerce') data['breastcancerper100th']=pandas.to_numeric(data['breastcancerper100th'], errors='coerce') data['co2emissions']=pandas.to_numeric(data['co2emissions'], errors='coerce') data['femaleemployrate']=pandas.to_numeric(data['femaleemployrate'], errors='coerce') data['hivrate']=pandas.to_numeric(data['hivrate'], errors='coerce') data['internetuserate']=pandas.to_numeric(data['internetuserate'], errors='coerce') data['lifeexpectancy']=pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['oilperperson']=pandas.to_numeric(data['oilperperson'], errors='coerce') data['relectricperperson']=pandas.to_numeric(data['relectricperperson'], errors='coerce') data['suicideper100th']=pandas.to_numeric(data['suicideper100th'], errors='coerce') data['employrate']=pandas.to_numeric(data['employrate'], errors='coerce') data['urbanrate']=pandas.to_numeric(data['urbanrate'], errors='coerce') print(''' Chosee the Countrys whit a unperfect democracy as comparation point ''') data['polityscore2']=pandas.cut(data.polityscore, [-10, 9 , 10], labels=[0,1]) data['polityscore2']=pandas.to_numeric(data['polityscore2'], errors='coerce') data_clean = data.dropna() data_clean.dtypes data_clean.describe() print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nTarget Variable chosen for the study are: polityscore') print('Predictors variables for the assignment:', 'internetuserate','incomeperperson','urbanrate', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate',) #%%# Split into training and testing sets predictors = data_clean[['internetuserate','incomeperperson','urbanrate', 'alcconsumption', 'armedforcesrate', 'co2emissions', 'femaleemployrate', 'hivrate', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate', ]] target = data_clean.polityscore predictors['internetuserate']=preprocessing.scale(predictors['internetuserate'].astype('float64')) predictors['incomeperperson']=preprocessing.scale(predictors['incomeperperson'].astype('float64')) predictors['urbanrate']=preprocessing.scale(predictors['urbanrate'].astype('float64')) predictors['alcconsumption']=preprocessing.scale(predictors['alcconsumption'].astype('float64')) predictors['armedforcesrate']=preprocessing.scale(predictors['armedforcesrate'].astype('float64')) #predictors['breastcancerper100th']=preprocessing.scale(predictors['breastcancerper100th'].astype('float64')) predictors['co2emissions']=preprocessing.scale(predictors['co2emissions'].astype('float64')) predictors['femaleemployrate']=preprocessing.scale(predictors['femaleemployrate'].astype('float64')) predictors['lifeexpectancy']=preprocessing.scale(predictors['lifeexpectancy'].astype('float64')) predictors['oilperperson']=preprocessing.scale(predictors['oilperperson'].astype('float64')) predictors['relectricperperson']=preprocessing.scale(predictors['relectricperperson'].astype('float64')) predictors['suicideper100th']=preprocessing.scale(predictors['suicideper100th'].astype('float64')) predictors['employrate']=preprocessing.scale(predictors['employrate'].astype('float64')) # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123) print(pred_train.shape) print(pred_test.shape) print(tar_train.shape) print(tar_test.shape) #%%# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) # print variable names and regression coefficients print('''variable names and regression coefficients''' ) print(dict(zip(predictors.columns, model.coef_))) #%%# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') #%%# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') #%%# Calculatin R-Scaure and MRE ## MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error) # R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test) print(dict(zip(predictors.columns, model.coef_)))

Result

ClassificationTree for binary polityscore

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

The following explanatory variables were included as possible contributors to a classification tree model

X[0] internetuserate X[1] incomeperperson X[2] urbanrate X[3] armedforcesrate X[4] lifeexpectancy X[5] employrate

The total model classified 86% of the sample correctly, 100% of inperct politycscor (sensitivity) and 96% of perferct politicscore (specificity).

[[13 3] [ 0 6]] 0.863636363636

CODE:

#!/usr/bin/env python3 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week4.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Running a Random Forest polityscore, internetuserate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' from pandas import Series, DataFrame import pandas import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance #Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image import pydotplus #from sklearn import datasets #from sklearn.ensemble import ExtraTreesClassifier #%%# Data Managment 1 data = pandas.read_csv("../gapminder.csv", low_memory=False) #Preapare Dataset data['polityscore']=pandas.to_numeric(data['polityscore'], errors='coerce') data['incomeperperson']=pandas.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption']=pandas.to_numeric(data['alcconsumption'], errors='coerce') data['armedforcesrate']=pandas.to_numeric(data['armedforcesrate'], errors='coerce') data['breastcancerper100th']=pandas.to_numeric(data['breastcancerper100th'], errors='coerce') data['co2emissions']=pandas.to_numeric(data['co2emissions'], errors='coerce') data['femaleemployrate']=pandas.to_numeric(data['femaleemployrate'], errors='coerce') data['hivrate']=pandas.to_numeric(data['hivrate'], errors='coerce') data['internetuserate']=pandas.to_numeric(data['internetuserate'], errors='coerce') data['lifeexpectancy']=pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['oilperperson']=pandas.to_numeric(data['oilperperson'], errors='coerce') data['relectricperperson']=pandas.to_numeric(data['relectricperperson'], errors='coerce') data['suicideper100th']=pandas.to_numeric(data['suicideper100th'], errors='coerce') data['employrate']=pandas.to_numeric(data['employrate'], errors='coerce') data['urbanrate']=pandas.to_numeric(data['urbanrate'], errors='coerce') print(''' Chosee the Countrys whit a unperfect democracy as comparation point ''') data['polityscore2']=pandas.cut(data.polityscore, [-10, 9 , 10], labels=[0,1]) data['polityscore2']=pandas.to_numeric(data['polityscore2'], errors='coerce') data_clean = data.dropna() data_clean.dtypes data_clean.describe() print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nTarget Variable chosen for the study are: polityscore') print('Predictors variables for the assignment:', 'internetuserate','incomeperperson','urbanrate', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate',) #%%# Split into training and testing sets predictors = data_clean[['internetuserate','incomeperperson','urbanrate', 'armedforcesrate', 'lifeexpectancy', 'employrate', ]] targets = data_clean.polityscore2 pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape #%%# Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) print(sklearn.metrics.confusion_matrix(tar_test,predictions)) print(sklearn.metrics.accuracy_score(tar_test, predictions)) out = StringIO() tree.export_graphviz(classifier, out_file=out) graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

RESULT:

Chosee the Countrys whit a unperfect democracy as comparation point Dataset gapminder Obeservations: 213 Variables: 17 Target Variable chosen for the study are: polityscore Predictors variables for the assignment: internetuserate incomeperperson urbanrate alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate lifeexpectancy oilperperson relectricperperson suicideper100th employrate [[7 4] [5 6]] 0.590909090909 In [2]: runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Machine Learning/week1.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Machine Learning') Reloaded modules: sklearn.feature_selection.rfe, sklearn.utils.extmath, sklearn.metrics.cluster.bicluster, sklearn.feature_selection.univariate_selection, sklearn.utils._scipy_sparse_lsqr_backport, sklearn.externals.joblib._multiprocessing_helpers, sklearn.metrics.classification, sklearn.externals.joblib.logger, sklearn.externals.joblib.func_inspect, sklearn.externals.joblib._memory_helpers, sklearn.externals.joblib.format_stack, sklearn.externals, sklearn.metrics.pairwise, sklearn.externals.joblib._compat, sklearn.utils.linear_assignment_, sklearn.metrics, sklearn.utils.lgamma, sklearn.tree._splitter, sklearn.externals.joblib, sklearn.preprocessing._function_transformer, sklearn.preprocessing, sklearn.externals.six, sklearn.metrics.cluster.unsupervised, sklearn.externals.joblib.memory, sklearn.utils.metaestimators, sklearn.externals.joblib.parallel, sklearn.__check_build._check_build, sklearn.preprocessing.imputation, sklearn.utils.sparsefuncs_fast, sklearn.feature_selection.variance_threshold, sklearn.externals.joblib.disk, sklearn.metrics.regression, sklearn.feature_selection.from_model, sklearn.utils.validation, sklearn.externals.joblib.my_exceptions, sklearn.externals.joblib.hashing, sklearn.utils.murmurhash, sklearn.preprocessing.label, sklearn.tree, sklearn.cross_validation, sklearn.utils.stats, sklearn.tree._criterion, sklearn.feature_selection, sklearn.metrics.cluster.supervised, sklearn.feature_selection.base, sklearn.utils.class_weight, sklearn.__check_build, sklearn, sklearn.tree._tree, sklearn.externals.joblib.pool, sklearn.externals.joblib.numpy_pickle, sklearn.preprocessing.data, sklearn.base, sklearn.metrics.cluster.expected_mutual_info_fast, sklearn.metrics.base, sklearn.utils._logistic_sigmoid, sklearn.utils, sklearn.tree.tree, sklearn.utils.sparsefuncs, sklearn.metrics.scorer, sklearn.metrics.ranking, sklearn.utils.fixes, sklearn.metrics.cluster, sklearn.tree._utils, sklearn.utils.multiclass, sklearn.metrics.pairwise_fast, sklearn.tree.export Chosee the Countrys whit a unperfect democracy as comparation point Dataset gapminder Obeservations: 213 Variables: 17 Target Variable chosen for the study are: polityscore Predictors variables for the assignment: internetuserate incomeperperson urbanrate alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate lifeexpectancy oilperperson relectricperperson suicideper100th employrate [[13 3] [ 0 6]] 0.863636363636

Running a Random Forest for binary polityscore

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating polityscore (my response variable) and the explantory variables next:

internetuserate incomeperperson urbanrate alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate lifeexpectancy oilperperson relectricperperson suicideper100th employrate

The explanatory variables with the highest relative importance scores were internetuserate and incomeperperson, deviance and grade point average.

Target Variable chosen for the study are: polityscore Predictors variables for the assignment: internetuserate incomeperperson urbanrate alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate lifeexpectancy oilperperson relectricperperson suicideper100th employrate [[ 8 1] [ 3 10]] 0.818181818182 [ 0.2075289 0.11358815 0.04904959 0.09347312 0.06755082 0.1234881 0.01568182 0.04710498 0.02204712 0.11837124 0.0386859 0.05551033 0.03180944 0.01611049]

The accuracy of the random forest was 81%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

Code:

#!/usr/bin/env python3 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week4.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Running a Random Forest polityscore, internetuserate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' from pandas import Series, DataFrame import pandas import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier #%%# Data Managment 1 data = pandas.read_csv("../gapminder.csv", low_memory=False) #Preapare Dataset data['polityscore']=pandas.to_numeric(data['polityscore'], errors='coerce') data['incomeperperson']=pandas.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption']=pandas.to_numeric(data['alcconsumption'], errors='coerce') data['armedforcesrate']=pandas.to_numeric(data['armedforcesrate'], errors='coerce') data['breastcancerper100th']=pandas.to_numeric(data['breastcancerper100th'], errors='coerce') data['co2emissions']=pandas.to_numeric(data['co2emissions'], errors='coerce') data['femaleemployrate']=pandas.to_numeric(data['femaleemployrate'], errors='coerce') data['hivrate']=pandas.to_numeric(data['hivrate'], errors='coerce') data['internetuserate']=pandas.to_numeric(data['internetuserate'], errors='coerce') data['lifeexpectancy']=pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['oilperperson']=pandas.to_numeric(data['oilperperson'], errors='coerce') data['relectricperperson']=pandas.to_numeric(data['relectricperperson'], errors='coerce') data['suicideper100th']=pandas.to_numeric(data['suicideper100th'], errors='coerce') data['employrate']=pandas.to_numeric(data['employrate'], errors='coerce') data['urbanrate']=pandas.to_numeric(data['urbanrate'], errors='coerce') print(''' Chosee the Countrys whit a unperfect democracy as comparation point ''') data['polityscore2']=pandas.cut(data.polityscore, [-10, 9 , 10], labels=[0,1]) data['polityscore2']=pandas.to_numeric(data['polityscore2'], errors='coerce') data_clean = data.dropna() data_clean.dtypes data_clean.describe() print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nTarget Variable chosen for the study are: polityscore') print('Predictors variables for the assignment:', 'internetuserate','incomeperperson','urbanrate', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate',) #%%# Split into training and testing sets predictors = data_clean[['internetuserate','incomeperperson','urbanrate', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate', ]] targets = data_clean.polityscore2 pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape #%%# uild model on training data from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) print(sklearn.metrics.confusion_matrix(tar_test,predictions)) print(sklearn.metrics.accuracy_score(tar_test, predictions)) # fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_) """ Running a different number of trees and see the effect of that on the accuracy of the prediction """ trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() plt.plot(trees, accuracy)

Result:

Can the Internet use democratizes society? Test a Logistic Regression Model

Question: Can the Internet use democratizes society? hypothesis: I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions.

Code:

#!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week4.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Test a Logistic Regression Mode polityscore, internetuserate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import seaborn as sns import scipy import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf data = pandas.read_csv("../gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #%%# Data Managment 1 #Preapare Dataset data['urbanrate']=pandas.to_numeric(data['urbanrate'], errors='coerce') data['internetuserate']=pandas.to_numeric(data['internetuserate'], errors='coerce') data['polityscore']=pandas.to_numeric(data['polityscore'], errors='coerce') data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects( convert_numeric=True) data['urbanrate'] = data['urbanrate'].convert_objects( convert_numeric=True) data['internetuserate']=data['internetuserate'].replace(' ', numpy.nan) data['polityscore']=data['polityscore'].replace(' ', numpy.nan) data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan) data['urbanrate']=data['urbanrate'].replace(' ', numpy.nan) sub1=data[['country', 'internetuserate', 'polityscore', 'incomeperperson', 'urbanrate']].dropna() #%%# ppolityscore2 bin it into two categories. (Data Managment 2) print(''' Chosee the Countrys whit a unperfect democracy as comparation point ''') sub1['polityscore2']=pandas.cut(sub1.polityscore, [-10, 7 , 10], labels=[0,1]) sub1['polityscore2']=pandas.to_numeric(sub1['polityscore2'], errors='coerce') sub1['incomeperpersongroup4']=pandas.qcut(sub1.urbanrate, 4, labels=[0, 1, 2 , 3]) sub1['incomeperpersongroup2']=pandas.qcut(sub1.urbanrate, 2, labels=[0, 1]) sub1=sub1[['country', 'internetuserate', 'polityscore', 'incomeperperson', 'urbanrate', 'polityscore2', 'incomeperpersongroup4', 'incomeperpersongroup2']].dropna() #%%# center quantitative IVs for regression analysis sub1['internetuserate_c'] = (sub1['internetuserate'] - sub1['internetuserate'].mean()) sub1['incomeperperson_c'] = (sub1['incomeperperson'] - sub1['incomeperperson'].mean()) sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean()) print(sub1[["internetuserate_c","incomeperperson_c", "urbanrate_c", "incomeperpersongroup4"]].describe()) # can override the default ad specify a different reference group # non-Hispanic White as reference group reg6 = smf.ols('polityscore ~ internetuserate_c + urbanrate_c + \ C(incomeperpersongroup4, Treatment(reference=3))', data=sub1).fit() print (reg6.summary()) print(sub1['polityscore2'].value_counts(sort=False, dropna=False)) #%%# LOGISTIC REGRESSION # logistic regression with social phobia lreg1 = smf.logit(formula = 'polityscore2 ~ internetuserate', data = sub1).fit() print (lreg1.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params)) # odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf)) # logistic regression with social phobia lreg1 = smf.logit(formula = 'polityscore2 ~ internetuserate + urbanrate', data = sub1).fit() print (lreg1.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params)) # odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

Result:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Regression Modeling in Practice/week4.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Regression Modeling in Practice') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson Chosee the Countrys whit a unperfect democracy as comparation point internetuserate_c incomeperperson_c urbanrate_c count 1.510000e+02 1.510000e+02 151.000000 mean -7.717153e-15 1.445554e-13 0.000000 std 2.743633e+01 9.598839e+03 22.285121 min -3.209319e+01 -6.383374e+03 -44.450596 25% -2.415311e+01 -5.893678e+03 -18.300596 50% -5.826034e+00 -4.265964e+03 2.329404 75% 1.929407e+01 -3.002972e+01 16.399404 max 6.097425e+01 3.348520e+04 45.149404 OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.197 Model: OLS Adj. R-squared: 0.170 Method: Least Squares F-statistic: 7.128 Date: Wed, 03 Feb 2016 Prob (F-statistic): 5.45e-06 Time: 23:18:55 Log-Likelihood: -468.70 No. Observations: 151 AIC: 949.4 Df Residuals: 145 BIC: 967.5 Df Model: 5 Covariance Type: nonrobust ========================================================================================================================= coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------------------------------------- Intercept 4.4448 2.156 2.062 0.041 0.184 8.706 C(incomeperpersongroup4, Treatment(reference=3))[T.0] -1.6970 4.198 -0.404 0.687 -9.995 6.601 C(incomeperpersongroup4, Treatment(reference=3))[T.1] -0.9596 2.737 -0.351 0.726 -6.370 4.451 C(incomeperpersongroup4, Treatment(reference=3))[T.2] 1.4337 1.774 0.808 0.420 -2.073 4.941 internetuserate_c 0.1011 0.022 4.542 0.000 0.057 0.145 urbanrate_c -0.0561 0.070 -0.801 0.424 -0.194 0.082 ============================================================================== Omnibus: 19.595 Durbin-Watson: 1.953 Prob(Omnibus): 0.000 Jarque-Bera (JB): 23.774 Skew: -0.970 Prob(JB): 6.88e-06 Kurtosis: 3.135 Cond. No. 400. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 0 85 1 66 Name: polityscore2, dtype: int64 Optimization terminated successfully. Current function value: 0.488711 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: polityscore2 No. Observations: 151 Model: Logit Df Residuals: 149 Method: MLE Df Model: 1 Date: Wed, 03 Feb 2016 Pseudo R-squ.: 0.2868 Time: 23:18:55 Log-Likelihood: -73.795 converged: True LL-Null: -103.47 LLR p-value: 1.325e-14 =================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept -2.0974 0.351 -5.979 0.000 -2.785 -1.410 internetuserate 0.0571 0.009 6.189 0.000 0.039 0.075 =================================================================================== Odds Ratios Intercept 0.122776 internetuserate 1.058719 dtype: float64 Lower CI Upper CI OR Intercept 0.061735 0.244170 0.122776 internetuserate 1.039759 1.078024 1.058719 Optimization terminated successfully. Current function value: 0.488463 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: polityscore2 No. Observations: 151 Model: Logit Df Residuals: 148 Method: MLE Df Model: 2 Date: Wed, 03 Feb 2016 Pseudo R-squ.: 0.2871 Time: 23:18:55 Log-Likelihood: -73.758 converged: True LL-Null: -103.47 LLR p-value: 1.252e-13 =================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept -2.2224 0.580 -3.833 0.000 -3.359 -1.086 internetuserate 0.0553 0.011 4.909 0.000 0.033 0.077 urbanrate 0.0033 0.012 0.274 0.784 -0.020 0.027 =================================================================================== Odds Ratios Intercept 0.108346 internetuserate 1.056811 urbanrate 1.003261 dtype: float64 Lower CI Upper CI OR Intercept 0.034778 0.337544 0.108346 internetuserate 1.033749 1.080386 1.056811 urbanrate 0.980150 1.026918 1.003261

Note About Data Management

My response variable is quantitative, so I had to bin it into 2 groups for the purpose of this exercise. I binned it into imperfect Democracy (-10 to 7) and Democratic countries.

Conclusion

All of my explanatory variables were found to have a significant positive influence on breast cancer rates:

internetuserate (Beta=0.055, P<0.001)

urban rate (Beta=0.003, P=0.78)

Although these variables all support the alternate hypothesis that they do have a statistically significant effect on Politic Score rates, all of these associations appear to be weak, based on the low Beta values. The effect of my primary explanatory variable (internetuserate).

internet use rate 95% CI = 1.033749 - 1.080386

urban rate 95% CI = 0.980150 - 1.026918

When I examined each of these variables in isolation, internet use rate appears to be the most significant variable, having a more significant negative impact on the odds ratios of the others. Again, I feel logistic regression is not necessarily the most useful method for investigating my research question.

Can the Internet use democratizes society? Test a Multiple Regression Model

Code:

#!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week3.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Test a Basic Linear Regression Model polityscore, internetuserate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import seaborn as sns import scipy import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf data = pandas.read_csv("../gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #Preapare Dataset data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects( convert_numeric=True) data['urbanrate'] = data['urbanrate'].convert_objects( convert_numeric=True) data['internetuserate']=data['internetuserate'].replace(' ', numpy.nan) data['polityscore']=data['polityscore'].replace(' ', numpy.nan) data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan) data['urbanrate']=data['urbanrate'].replace(' ', numpy.nan) sub1=data[['country', 'internetuserate', 'polityscore', 'incomeperperson', 'urbanrate']].dropna() #%%# POLYNOMIAL REGRESSION #basic scatterplot: Q->Q plt.figure() scat1 = sns.regplot(x="internetuserate", y="polityscore", data=sub1) plt.xlabel('Internet Use Rate') plt.ylabel('Overall polity score') plt.title('Scatterplot for the Association Between Internet Use Rate and' \ 'Overall polity score') plt.show() plt.close() # fit second order polynomial # run the 2 scatterplots together to get both linear and second order fit lines plt.figure() scat1 = sns.regplot(x="internetuserate", y="polityscore", order=2, data=sub1) plt.xlabel('Internet Use Rate') plt.ylabel('Overall polity score') plt.title('Scatterplot for the Association Between Internet Use Rate and' \ 'Overall polity score') plt.show() plt.close() # center quantitative IVs for regression analysis sub1['internetuserate_c'] = (sub1['internetuserate'] - sub1['internetuserate'].mean()) sub1['incomeperperson_c'] = (sub1['incomeperperson'] - sub1['incomeperperson'].mean()) sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean()) print(sub1[["internetuserate_c","incomeperperson_c", "urbanrate_c"]].describe()) # linear regression analysis reg1 = smf.ols('polityscore ~ internetuserate_c', data=sub1).fit() print (reg1.summary()) # quadratic (polynomial) regression analysis reg2 = smf.ols('polityscore ~ internetuserate_c + I(internetuserate_c**2)', data=sub1).fit() print (reg2.summary()) #%%# EVALUATING MODEL FIT print("Dicard Cuadratic internetuserate_c not fit in Model") # adding internet use rate reg3 = smf.ols('polityscore ~ internetuserate_c + urbanrate_c + incomeperperson_c', data=sub1).fit() print (reg3.summary()) #Q-Q plot for normality fig4=sm.qqplot(reg3.resid, line='r') # simple plot of residuals plt.figure() stdres=pandas.DataFrame(reg3.resid_pearson) plt.plot(stdres, 'o', ls='None') l = plt.axhline(y=0, color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') plt.show() plt.close() # additional regression diagnostic plots plt.figure() fig2 = plt.figure(figsize=(12,8)) fig2 = sm.graphics.plot_regress_exog(reg3, "urbanrate_c", fig=fig2) plt.show() plt.close() # leverage plot fig3=sm.graphics.influence_plot(reg3, size=8) print(fig3)

Results:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Regression Modeling in Practice/week3.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Regression Modeling in Practice') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson internetuserate_c incomeperperson_c urbanrate_c count 1.530000e+02 1.530000e+02 1.530000e+02 mean -2.461357e-15 3.299147e-12 9.845429e-15 std 2.755326e+01 9.792742e+03 2.249083e+01 min -3.247217e+01 -6.581954e+03 -4.489739e+01 25% -2.431203e+01 -6.089856e+03 -1.847739e+01 50% -5.942212e+00 -4.463395e+03 1.982614e+00 75% 1.923195e+01 6.088228e+01 1.632261e+01 max 6.059527e+01 3.328662e+04 4.470261e+01 OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.135 Model: OLS Adj. R-squared: 0.130 Method: Least Squares F-statistic: 23.64 Date: Mon, 25 Jan 2016 Prob (F-statistic): 2.88e-06 Time: 23:16:00 Log-Likelihood: -484.91 No. Observations: 153 AIC: 973.8 Df Residuals: 151 BIC: 979.9 Df Model: 1 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 3.9412 0.468 8.413 0.000 3.016 4.867 internetuserate_c 0.0829 0.017 4.862 0.000 0.049 0.117 ============================================================================== Omnibus: 23.318 Durbin-Watson: 2.010 Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.690 Skew: -1.075 Prob(JB): 3.57e-07 Kurtosis: 3.193 Cond. No. 27.5 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.136 Model: OLS Adj. R-squared: 0.124 Method: Least Squares F-statistic: 11.81 Date: Mon, 25 Jan 2016 Prob (F-statistic): 1.73e-05 Time: 23:16:00 Log-Likelihood: -484.85 No. Observations: 153 AIC: 975.7 Df Residuals: 150 BIC: 984.8 Df Model: 2 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 3.7579 0.726 5.177 0.000 2.324 5.192 internetuserate_c 0.0789 0.021 3.770 0.000 0.038 0.120 I(internetuserate_c ** 2) 0.0002 0.001 0.331 0.741 -0.001 0.002 ============================================================================== Omnibus: 22.801 Durbin-Watson: 2.015 Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.856 Skew: -1.060 Prob(JB): 5.42e-07 Kurtosis: 3.178 Cond. No. 1.68e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.68e+03. This might indicate that there are strong multicollinearity or other numerical problems. Dicard Cuadratic internetuserate_c not fit in Model OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.144 Model: OLS Adj. R-squared: 0.127 Method: Least Squares F-statistic: 8.373 Date: Mon, 25 Jan 2016 Prob (F-statistic): 3.51e-05 Time: 23:16:00 Log-Likelihood: -484.12 No. Observations: 153 AIC: 976.2 Df Residuals: 149 BIC: 988.4 Df Model: 3 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 3.9412 0.469 8.400 0.000 3.014 4.868 internetuserate_c 0.1078 0.032 3.368 0.001 0.045 0.171 urbanrate_c -0.0336 0.028 -1.186 0.237 -0.090 0.022 incomeperperson_c -2.223e-05 8.19e-05 -0.271 0.786 -0.000 0.000 ============================================================================== Omnibus: 21.852 Durbin-Watson: 1.975 Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.386 Skew: -1.034 Prob(JB): 1.13e-06 Kurtosis: 3.137 Cond. No. 9.76e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 9.76e+03. This might indicate that there are strong multicollinearity or other numerical problems. Figure(640x440)

a) q-q plot

b) standardized residuals for all observations

c) leverage plot

additional regression diagnostic plots

Analisys

After adjusting for potential confounding factors, Politic Score (Beta=0.136, p=.0001) was significantly and positively associated with Internet Use Rate. When adding additional variables like urbanrate_c and incomeperperson_c increase the percentage of predicted values (Beta=0.144, p=.001)

Whatever this mean a weak relation between Politic Score and Internet Use Rate. When adding possible cofunding variables this variables get no acceptables P values.

Adding additional variable like urbanrate_c and incomeperperson_c This variavbles is not significantly associated with Politic Score.

OLS Analysis.

OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.144 Model: OLS Adj. R-squared: 0.127 Method: Least Squares F-statistic: 8.373 Date: Mon, 25 Jan 2016 Prob (F-statistic): 3.51e-05 Time: 23:16:00 Log-Likelihood: -484.12 No. Observations: 153 AIC: 976.2 Df Residuals: 149 BIC: 988.4 Df Model: 3 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 3.9412 0.469 8.400 0.000 3.014 4.868 internetuserate_c 0.1078 0.032 3.368 0.001 0.045 0.171 urbanrate_c -0.0336 0.028 -1.186 0.237 -0.090 0.022 incomeperperson_c -2.223e-05 8.19e-05 -0.271 0.786 -0.000 0.000 ============================================================================== Omnibus: 21.852 Durbin-Watson: 1.975 Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.386 Skew: -1.034 Prob(JB): 1.13e-06 Kurtosis: 3.137 Cond. No. 9.76e+03 ==============================================================================

Can the Internet use democratizes society? Test a Basic Linear Regression Model

Code:

Result:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Regression Modeling in Practice/week2.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Regression Modeling in Practice') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson 32.530946372 -1.36149156465e-14 OLS regression model for the association between urban rate and internet use rate OLS Regression Results ================================================================================= Dep. Variable: internetuseratecenter R-squared: 0.133 Model: OLS Adj. R-squared: 0.127 Method: Least Squares F-statistic: 23.42 Date: Sun, 17 Jan 2016 Prob (F-statistic): 3.15e-06 Time: 22:05:25 Log-Likelihood: -721.96 No. Observations: 155 AIC: 1448. Df Residuals: 153 BIC: 1454. Df Model: 1 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Intercept -6.1686 2.424 -2.545 0.012 -10.958 -1.380 polityscore 1.6043 0.331 4.840 0.000 0.949 2.259 ============================================================================== Omnibus: 12.405 Durbin-Watson: 1.565 Prob(Omnibus): 0.002 Jarque-Bera (JB): 9.960 Skew: 0.523 Prob(JB): 0.00687 Kurtosis: 2.330 Cond. No. 8.64 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Histogram Internet Use Rate:

Histogram Internet Use Rate Center:

Scatterplot Internet Use ~ Polytic score:

Result:

The results of the linear regression model indicated that Internet Use Rate (Beta=0.133, p=.0001). was weak significantly and positively associated with Politic score.

politcscore = -6.1686 + 1.6043 * (”center internet Use”)

About Gapminder Dataset

Description about gapminder:

Gapminder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.

Sample:

URL: http://www.gapminder.org/data/

The sample data from the GapMinder data set includes country-level indicators of health, wealth and development from 260 countries.

The number of countries and territories to include is arbitrary, but we gapminder decided to include the following entities:

192 UN members (as of April 2008)

51 other entities listed in the “List of countries” in Wikipedia (2008-05-13). These include the Vatican, dependent territories, special entities and disputed territories. We have excluded the two “sub-dependencies” Ascension Island and Tristan da Cunha, although they are listed by Wikipedia.

4 French overseas territories (Guadeloupe, Martinique, Reunion and French Guyana), although they are considered an integral part of France

10 former states

2 ad-hoc areas: “Serbia excluding Kosovo” and “the Channel Islands”. The latter is the collective name of the two dependent territories Guernsey and Jersey.

UPDATE: In spetember 2012 we added the new UN state “South Sudan” to our country list. The new number of UN members are hence 193.

This gives a total of 193+51+4+10+2=260 countries and territories. Our goal is to have data for all these entities for at least two indicators (one of them being population), from 1800 onwards. However, most indicators will only have data for a selection of these entities.

All data are grouped at the country level. Data fields used to analyze the relationship between

Explanatory Variables: Internet users (per 100 people)

Response Variable: Democracy score (based on Polity IV)

Procedures:

The values of the study were not controlled by the researcher and in my study, the explanatory variable has not been manipulated. No specific study was performed for obtaining the Gapminder data. Data reporting was used to get the observational data for Gapminder. The data I used were obtained from different institutions:

Polity IV project

World Bank

The data were collected to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels.

variables and measured:

Internet users (per 100 people)

URL: http://data.worldbank.org/indicator/IT.NET.USER.P2

Internet users are individuals who have used the Internet (from any location) in the last 12 months. Internet can be used via a computer, mobile phone, personal digital assistant, games machine, digital TV etc.

Democracy score (based on Polity IV)

URL: http://www.systemicpeace.org/polity/polity4.htm

Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.

Notes: Gapminder used the variable "polity2". Polity2 is the same as "polity" with the following main differences: anarchy or interregnum has been coded as 0 (in "polity" it was coded as "-77"). Transitions were interpolated, when possible (in "polity" it was coded as "-88"). Foreign interuptions were left blank (in "polity" it was coded as "-66").

Gapminder modifications 1: Countries that emerged when older countries broke up (e.g. former USSR republics): When an area was part of an older unit (and hence had no rating) we used the rating for that older unit. We did this to avoid gaps in the data, which the program would in such case just interpolate (e.g. Estonia had availabeldata both from before the USSR era and after. If the USSR era would be left blank an interpolation would be displayed in the graph for the USSR years).

Gapminder modification 2: Some countries that were created when older countries merged (e.g. Germany from East and West germany): We used the rating for the biggest unit for the relevant years.We did this to avoid gaps in the data, which the program would in such case just interpolate. We did this only when it was absolutely necessary to avoid such gaps.

Gapminder modification 3: Years with "foreign interruptions" (coded as "-66" in polity and left blank in polity 2). We had to fill them to avoid simple interpolation by the program. We did guestimates based on a quick reading of the history of the event (e.g. from Wikipedia) - so the decisions are certainly open to discussion. We applied some rule of thumbs: (1) occupations by highly undemocratic countries (e.g. nazi germany) were assigned the lowest value (-10), although the nature of the occupation might have varied. (2) occupation of a democratic regime in a undemocratic one, that ended with a democracy (e.g. allie occupation of germany) were interpolated. (3) occupation in undemocratic countries by not fully democratic countries and where the effect on democracy after the intervention is unclear. We often either interpolated or reatined the ratings just before the intervention (however we would do it would in any case imply a low democracy score.

Can the Internet use democratizes society? Testing Income per person as Potential Moderate variable.

Original Question: Can the Internet use democratizes society? Original hypothesis: I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions.

Moderate Question: the income per person variable moderate the relationship by Internetuserate whit Poltic Score?

Code:

#!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week4.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Perform analysis of variance for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import seaborn as sns import scipy import matplotlib.pyplot as plt data = pandas.read_csv("../gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects( convert_numeric=True) data['internetuserate']=data['internetuserate'].replace(' ', numpy.nan) data['polityscore']=data['polityscore'].replace(' ', numpy.nan) data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan) sub1=data[['country','internetuserate', 'polityscore','incomeperperson']].dropna() data_potential_moderator = sub1.copy() #basic scatterplot: Q->Q plt.figure() scat1 = sns.regplot(x="internetuserate", y="polityscore", data=data_potential_moderator) plt.xlabel('Internet Use Rate') plt.ylabel('Overall polity score') plt.title('Scatterplot for the Association Between Internet Use Rate and' \ 'Overall polity score') plt.show() plt.close() print ('association between internetuserat and polityscore') print (scipy.stats.pearsonr(data_potential_moderator['internetuserate'], data_potential_moderator['polityscore'])) #%% Incomperson as Potential Modarator def incomegrp (row): if row['incomeperperson'] <= 744.239: return 1 elif row['incomeperperson'] <= 9425.326 : return 2 elif row['incomeperperson'] > 9425.326: return 3 data_potential_moderator['incomegrp'] = data_potential_moderator.apply (lambda row: incomegrp (row),axis=1) chk1 = data_potential_moderator['incomegrp'].value_counts(sort=False, dropna=False) print("Income Groups 1 = LOW INCOME, 2 = MEDIUM INCOME, 3=HIGT INCOME") print(chk1) sub1=data_potential_moderator[(data_potential_moderator['incomegrp']== 1)] sub2=data_potential_moderator[(data_potential_moderator['incomegrp']== 2)] sub3=data_potential_moderator[(data_potential_moderator['incomegrp']== 3)] #%% LOW Incomperson as Potential Modarator plt.figure() scat1 = sns.regplot(x="internetuserate", y="polityscore", data=sub1) plt.xlabel('Urban Rate') plt.ylabel('Internet Use Rate') plt.title('Scatterplot for the Association Between Internet Use Rate and Politic Score for LOW income countries') plt.show() plt.close() print ('association between urbanrate and internetuserate for LOW income countries') print (scipy.stats.pearsonr(sub1['internetuserate'], sub1['polityscore'])) #%% Middle Incomperson as Potential Modarator plt.figure() scat2 = sns.regplot(x="internetuserate", y="polityscore", data=sub2) plt.xlabel('Internet Use Rate') plt.ylabel('Politic Score') plt.title('Scatterplot for the Association Between Internet Use Rate and Politic Score for MIDDLE income countries') plt.show() plt.close() print ('association between urbanrate and internetuserate for MIDDLE income countries') print (scipy.stats.pearsonr(sub2['internetuserate'], sub2['polityscore'])) #%% High Incomperson as Potential Modarator plt.figure() scat3 = sns.regplot(x="internetuserate", y="polityscore", data=sub3) plt.xlabel('Internet Use Rate') plt.ylabel('Politic Score') plt.title('Scatterplot for the Association Between Internet Use Rate and Politic Score for HIGH income countries') plt.show() plt.close() print ('association between urbanrate and internetuserate for HIGH income countries') print (scipy.stats.pearsonr(sub3['internetuserate'], sub3['polityscore']))

Result:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Data Analysis Tools/week4.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Data Analysis Tools') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson association between internetuserat and polityscore (0.36793963434864463, 2.8821795590767532e-06) Income Groups 1 = LOW INCOME, 2 = MEDIUM INCOME, 3=HIGT INCOME 1 45 2 76 3 32 dtype: int64 association between urbanrate and internetuserate for LOW income countries (0.06563803599695886, 0.66837059684015609) association between urbanrate and internetuserate for MIDDLE income countries (0.31325899553182263, 0.0058622755653961051) association between urbanrate and internetuserate for HIGH income countries (0.075925075001272269, 0.67960207113372362)

Scatter Plots:

Analysis:

Ok….. in the Correlation Analysis result as very weak relation in the Polity Score Whit Internet Use In Goups of countrys whit LOW and HIGH Income per person. In the MIDDLE Group has weak positive relation a coefficient of and 0.31 a p-value of 0.005. We conclude that the correlation is statistically significant.

R=0.31

Finally calculating the squared correlation coefficient (r2), we obtain the degree of variability of the variables that explain (or predict) the other.

R²=0,0961

In our case, the coefficient is equal to r2 = 0.1296. In other words, if we know the value of Politc Score, we can predict 9% of the variability we will see in Internet Use or we can not explain the other 91% of variability.

Can the Internet use democratizes society? Correlation Coefficient

3Question: Can the Internet use democratizes society? hypothesis: I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions.

Code:

#!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week3.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Perform analysis of variance for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import seaborn as sns import scipy import matplotlib.pyplot as plt data = pandas.read_csv("../gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) data['urbanrate'] = data['urbanrate'].convert_objects( convert_numeric=True) data['internetuserate']=data['internetuserate'].replace(' ', numpy.nan) data['urbanrate']=data['urbanrate'].replace(' ', numpy.nan) data['polityscore']=data['polityscore'].replace(' ', numpy.nan) sub1=data[['country','internetuserate', 'polityscore','urbanrate']].dropna() data_variance = sub1.copy() #basic scatterplot: Q->Q plt.figure() scat1 = sns.regplot(x="internetuserate", y="polityscore", data=data_variance) plt.xlabel('Internet Use Rate') plt.ylabel('Overall polity score') plt.title('Scatterplot for the Association Between Internet Use Rate and' \ 'Overall polity score') plt.show() plt.close() data_clean=data_variance.dropna() print ('association between urbanrate and internetuserate') print (scipy.stats.pearsonr(data_clean['internetuserate'], data_clean['polityscore'])) #basic scatterplot: Q->Q plt.figure() scat1 = sns.regplot(x="urbanrate", y="polityscore", data=data_variance) plt.xlabel('Urban Rate') plt.ylabel('Overall polity score') plt.title('Scatterplot for the Association Between Urban Rate and' \ 'Overall polity score') plt.show() plt.close() data_clean=data_variance.dropna() print ('association between urbanrate and polityscore') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['polityscore']))

Result:

In [6]: runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Data Analysis Tools/week3.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Data Analysis Tools') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson association between urbanrate and internetuserate (0.36438422712027008, 3.1453595920263592e-06) association between urbanrate and polityscore (0.36438422712027008, 3.1453595920263592e-06) In [7]:

scatter plots:

Analysis:

Internet Use <-> Potity Score

The Pearson correlation chart indicates that the relationship between INCOMEPERPERSON and LIFEEXPECTANCY, has a coefficient of 0.60

We conclude that the correlation is statistically significant.

In our case, the coefficient is equal to r2 (0.60152 * 0.60152) = 0.3618. In other words, if we know the value of INCOMEPERPERSON, we can predict 36% of the variability we will see in LIFEEXPECTANCY, or we can not explain the other 64% of variability.

Ok..... in the Correlation Analysis result as weak positive relation in the Polity Score Whit Internet Use. has a coefficient of and 0.36 a p-value of 0.001. We conclude that the correlation is statistically significant.

R=0.36

Finally calculating the squared correlation coefficient (r2), we obtain the degree of variability of the variables that explain (or predict) the other.

R²=0,1296

In our case, the coefficient is equal to r2 = 0.1296. In other words, if we know the value of Politc Score, we can predict 12% of the variability we will see in Internet Use or we can not explain the other 87% of variability.

(Aditional Test ) Urban Rate <-> Potity Score

Correlation Analysis result as week positive relation in the Polity Score Whit Internet Use.

R=0.16

R²=0,0256

Can the Internet use democratizes society? Chi-Square Test of Independence

Code:

#!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week2.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Perform analysis of variance for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import seaborn import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import matplotlib.pyplot as plt import scipy data = pandas.read_csv("../gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) sub1=data[['country','internetuserate', 'polityscore']].dropna() data_variance = sub1.copy() print('= categorizing explanatory variable intusegroup =') print('intusegroup4 - 4 categories - 0 low use, 10 is massive use') data_variance['intusegroup4']=pandas.cut(data_variance.internetuserate, [0, 25 , 50, 75, 100],labels=[25,50,75,100]) intusegroup4_count = data_variance['intusegroup4'].value_counts(sort=False) print (intusegroup4_count) print('\n==Percentages for internetuserate grouped as intusegroup4 ==') print('''Internet users (per 100 people) Internet users are people with access to the worldwide network.''') intusegroup4_percent = data_variance['intusegroup4'].value_counts(sort=False, normalize=True) print (intusegroup4_percent) print('= categorizing response variable polityscore =') print('polityscore2 - 2 polityscore') data_variance['polityscore2']=pandas.cut(data_variance.polityscore, [-10, 7 , 10], labels=[0,1]) polityscore2_count = data_variance['polityscore2'].value_counts(sort=False) print (polityscore2_count) print('\n==Percentages for Poltiyscore grouped as poltiyscore2 ==') print('''Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.''') polityscore2_percent = data_variance['polityscore2'].value_counts(sort=False, normalize=True) print (polityscore2_percent) print('\n= contingency table of observed counts =') # contingency table of observed counts ct1=pandas.crosstab(data_variance['polityscore2'], data_variance['intusegroup4']) print (ct1) print('\n= column percentages =') # column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct) print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1) # set variable types data_variance["intusegroup4"] = data_variance["intusegroup4"].astype('category') # new code for setting variables to numeric: data_variance['polityscore2'] = data_variance['polityscore2'].convert_objects( convert_numeric=True) seaborn.factorplot(x="intusegroup4", y="polityscore2", data=data_variance, kind="bar", ci=None) plt.ylabel('Politic Score 1:Democratic 0:inperfect democracy') plt.show() plt.close() print('\n= Comparative 25 and 50 =') recode2 = {25: 25, 50: 50} data_variance['COMP25v50']= data_variance['intusegroup4'].map(recode2) # contingency table of observed counts ct2=pandas.crosstab(data_variance['polityscore2'], data_variance['COMP25v50']) print (ct2) # column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2) print('\n= Comparative 25 and 75 =') recode3 = {25: 25, 75: 75} data_variance['COMP25v75']= data_variance['intusegroup4'].map(recode3) # contingency table of observed counts ct3=pandas.crosstab(data_variance['polityscore2'], data_variance['COMP25v75']) print (ct3) # column percentages colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct) print ('chi-square value, p value, expected counts') cs3= scipy.stats.chi2_contingency(ct3) print (cs3) print('\n= Comparative 25 and 100 =') recode4 = {25: 25, 100: 100} data_variance['COMP25v100']= data_variance['intusegroup4'].map(recode4) # contingency table of observed counts ct4=pandas.crosstab(data_variance['polityscore2'], data_variance['COMP25v100']) print (ct4) # column percentages colsum=ct4.sum(axis=0) colpct=ct4/colsum print(colpct) print ('chi-square value, p value, expected counts') cs4= scipy.stats.chi2_contingency(ct4) print (cs4) print('\n= Comparative 50 and 75=') recode5 = {50: 50, 75: 75} data_variance['COMP50v75']= data_variance['intusegroup4'].map(recode5) # contingency table of observed counts ct5=pandas.crosstab(data_variance['polityscore2'], data_variance['COMP50v75']) print (ct5) # column percentages colsum=ct5.sum(axis=0) colpct=ct5/colsum print(colpct) print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct5) print (cs5) print('\n= Comparative 50 and 100 =') recode6 = {50: 50, 100: 100} data_variance['COMP50v100']= data_variance['intusegroup4'].map(recode6) # contingency table of observed counts ct6=pandas.crosstab(data_variance['polityscore2'], data_variance['COMP50v100']) print (ct6) # column percentages colsum=ct6.sum(axis=0) colpct=ct6/colsum print(colpct) print ('chi-square value, p value, expected counts') cs6= scipy.stats.chi2_contingency(ct6) print (cs6) print('\n= Comparative 75 and 100 =') recode7 = {75: 75, 100: 100} data_variance['COMP75v100']= data_variance['intusegroup4'].map(recode7) # contingency table of observed counts ct7=pandas.crosstab(data_variance['polityscore2'], data_variance['COMP75v100']) print (ct7) # column percentages colsum=ct7.sum(axis=0) colpct=ct7/colsum print(colpct) print ('chi-square value, p value, expected counts') cs7= scipy.stats.chi2_contingency(ct7) print (cs7)

Result:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/Data Analysis Tools/week2.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src/Data Analysis Tools') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson = categorizing explanatory variable intusegroup = intusegroup4 - 4 categories - 0 low use, 10 is massive use 25 74 50 41 75 23 100 17 dtype: int64 ==Percentages for internetuserate grouped as intusegroup4 == Internet users (per 100 people) Internet users are people with access to the worldwide network. 25 0.477419 50 0.264516 75 0.148387 100 0.109677 dtype: float64 = categorizing response variable polityscore = polityscore2 - 2 polityscore 0 87 1 66 dtype: int64 ==Percentages for Poltiyscore grouped as poltiyscore2 == Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. 0 0.561290 1 0.425806 dtype: float64 = contingency table of observed counts = intusegroup4 25 50 75 100 polityscore2 0 60 22 4 1 1 14 18 19 15 = column percentages = intusegroup4 25 50 75 100 polityscore2 0 0.810811 0.55 0.173913 0.0625 1 0.189189 0.45 0.826087 0.9375 chi-square value, p value, expected counts (49.069261660631597, 1.2609118937138063e-10, 3, array([[ 42.07843137, 22.74509804, 13.07843137, 9.09803922], [ 31.92156863, 17.25490196, 9.92156863, 6.90196078]])) = Comparative 25 and 50 = COMP25v50 25 50 polityscore2 0 60 22 1 14 18 COMP25v50 25 50 polityscore2 0 0.810811 0.55 1 0.189189 0.45 chi-square value, p value, expected counts (7.5034569153757431, 0.0061580678540386433, 1, array([[ 53.22807018, 28.77192982], [ 20.77192982, 11.22807018]])) = Comparative 25 and 75 = COMP25v75 25 75 polityscore2 0 60 4 1 14 19 COMP25v75 25 75 polityscore2 0 0.810811 0.173913 1 0.189189 0.826087 chi-square value, p value, expected counts (28.934665837094506, 7.4861091265323711e-08, 1, array([[ 48.82474227, 15.17525773], [ 25.17525773, 7.82474227]])) = Comparative 25 and 100 = COMP25v100 25 100 polityscore2 0 60 1 1 14 15 COMP25v100 25 100 polityscore2 0 0.810811 0.0625 1 0.189189 0.9375 chi-square value, p value, expected counts (30.391698050509525, 3.5303583572118305e-08, 1, array([[ 50.15555556, 10.84444444], [ 23.84444444, 5.15555556]])) = Comparative 50 and 75= COMP50v75 50 75 polityscore2 0 22 4 1 18 19 COMP50v75 50 75 polityscore2 0 0.55 0.173913 1 0.45 0.826087 chi-square value, p value, expected counts (7.0407504180601999, 0.0079675641424967122, 1, array([[ 16.50793651, 9.49206349], [ 23.49206349, 13.50793651]])) = Comparative 50 and 100 = COMP50v100 50 100 polityscore2 0 22 1 1 18 15 COMP50v100 50 100 polityscore2 0 0.55 0.0625 1 0.45 0.9375 chi-square value, p value, expected counts (9.2982872200263511, 0.0022936819980624518, 1, array([[ 16.42857143, 6.57142857], [ 23.57142857, 9.42857143]])) = Comparative 75 and 100 = COMP75v100 75 100 polityscore2 0 4 1 1 19 15 COMP75v100 75 100 polityscore2 0 0.173913 0.0625 1 0.826087 0.9375 chi-square value, p value, expected counts (0.28816735933503818, 0.59139732995039396, 1, array([[ 2.94871795, 2.05128205], [ 20.05128205, 13.94871795]]))

Gaph:

Model Interpretation for Chi-Square Tests:

When examining the association between Internet Use Rate (categorical response.. Grouped in 4 groups.0-25, 25-50, 50-75, 75-100) and Polity Score (categorical explanatory... Grouped in 1: Democratic Country and 0:Inperfect democracy ) a chi-square test of independence revealed that Internet Use Rate, those with more Internet Use Rate have is more apximate to a real democracy X2=49.069261660631597, P=1.2609118937138063e-10.

Model Interpretation for post hoc Chi-Square Test results:

A Chi Square test of independence revealed that Internet User Rate, And Polity Score. (binary categorical variable) were significantly associated, X2=49.069261660631597, P=1.2609118937138063e-10.

Post hoc comparisons of rates of Internet User Rate by pairs of categories revealed that higher rates of Polity Score. were seen among those Internet User Rate up to 50 to 100. In comparison, prevalence of Polity Score. was statistically similar among those groups smoking

Can the Internet use democratizes society? Analysis of Variance

Code:

#!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week1.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Perform analysis of variance for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi data = pandas.read_csv("../gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) sub1=data[['country','internetuserate', 'polityscore']].dropna() data_variance = sub1.copy() print('intusegroup4 - 4 categories - 0 low use, 10 is massive use') data_variance['intusegroup4']=pandas.cut(data_variance.internetuserate, [0, 25 , 50, 75, 100]) intusegroup4_count = data_variance['intusegroup4'].value_counts(sort=False) print (intusegroup4_count) print('\n==Percentages for internetuserate grouped as intusegroup4 ==') print('''Internet users (per 100 people) Internet users are people with access to the worldwide network.''') intusegroup4_percent = data_variance['intusegroup4'].value_counts(sort=False, normalize=True) print (intusegroup4_percent) model = smf.ols(formula='polityscore ~ C(intusegroup4)', data=data_variance).fit() print (model.summary()) mc1 = multi.MultiComparison(data_variance['polityscore'], data_variance['intusegroup4']) res1 = mc1.tukeyhsd() print(res1.summary())

Result:

Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson intusegroup4 - 4 categories - 0 low use, 10 is massive use (0, 25] 74 (25, 50] 41 (50, 75] 23 (75, 100] 17 dtype: int64 ==Percentages for internetuserate grouped as intusegroup4 == Internet users (per 100 people) Internet users are people with access to the worldwide network. (0, 25] 0.477419 (25, 50] 0.264516 (50, 75] 0.148387 (75, 100] 0.109677 dtype: float64 OLS Regression Results ============================================================================== Dep. Variable: polityscore R-squared: 0.133 Model: OLS Adj. R-squared: 0.116 Method: Least Squares F-statistic: 7.733 Date: Mon, 07 Dec 2015 Prob (F-statistic): 7.71e-05 Time: 00:46:23 Log-Likelihood: -492.18 No. Observations: 155 AIC: 992.4 Df Residuals: 151 BIC: 1005. Df Model: 3 Covariance Type: nonrobust ================================================================================================ coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------------ Intercept 1.9054 0.682 2.793 0.006 0.558 3.253 C(intusegroup4)[T.(25, 50]] 1.9239 1.142 1.684 0.094 -0.333 4.181 C(intusegroup4)[T.(50, 75]] 5.4424 1.401 3.885 0.000 2.675 8.210 C(intusegroup4)[T.(75, 100]] 5.6828 1.578 3.601 0.000 2.565 8.801 ============================================================================== Omnibus: 21.202 Durbin-Watson: 2.057 Prob(Omnibus): 0.000 Jarque-Bera (JB): 26.417 Skew: -1.010 Prob(JB): 1.84e-06 Kurtosis: 3.082 Cond. No. 4.02 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Multiple Comparison of Means - Tukey HSD,FWER=0.05 ================================================= group1 group2 meandiff lower upper reject ------------------------------------------------- (0, 25] (25, 50] 1.9239 -1.0442 4.892 False (0, 25] (50, 75] 5.4424 1.8029 9.0819 True (0, 25] (75, 100] 5.6828 1.5825 9.7831 True (25, 50] (50, 75] 3.5186 -0.4531 7.4902 False (25, 50] (75, 100] 3.759 -0.6388 8.1567 False (50, 75] (75, 100] 0.2404 -4.6357 5.1166 False -------------------------------------------------

Model Interpretation for ANOVA:

When examining the association between current polityscore (quantitative response) and internetuserate categorize in 4 groups

Group Count

(0, 25] 74

(25, 50] 41

(50, 75] 23

(75, 100] 17

(categorical explanatory), an Analysis of Variance (ANOVA) revealed that among daily, young adult smokers (my sample), those with better internet Use significantly more polityscore.

Following ‘F’ can be found in the OLS table as the DF model and DF residuals. In this example 7.733 is the actual F value from the OLS table and we commonly report a very small p value as simply <.0001. (7.71e-05)

Model Interpretation for post hoc ANOVA results:

ANOVA revealed de intenetusereat and polityscore were significantly associated, F=7.733, p=7.71e-05 Post hoc comparisons of mean internetuserate groups by polityscore. All other comparisons were statistically similar.

Can the Internet use democratizes society? Analysis in graphs

Code:

#!/usr/bin/env python2.6 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week4.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Examine frequency distributions, for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 _question="Can the Internet use democratizes society?" _hypothesis=''' I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions. ''' import pandas import numpy import seaborn as sns import matplotlib.pyplot as plt data = pandas.read_csv("gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) sub1=data[['country','internetuserate', 'polityscore', 'urbanrate', 'incomeperperson']] data_graph = sub1.copy() print(''' ############################################ ## Explanatory Variable = Internet Use ############################################ ''') # standard deviation and other descriptive statistics for internetuserate print('\n==Descrive internetuserate ==') print('Internet users (per 100 people) Internet users are people with access\n' 'to the worldwide network.') desc_internetuserate = data_graph['internetuserate'].describe() print(desc_internetuserate) #Univariate histogram for quantitative variable: plt.figure() sns.distplot(data_graph["internetuserate"].dropna(), kde=False); plt.xlabel('Internet users (per 100 people)') plt.title('Histogram of internetuser in the gapminder') plt.show() plt.close() print(''' ############################################ ## Response Variable = polityscore ############################################ ''') # standard deviation and other descriptive statistics for internetuserate print('\n==Descrive polityscore==') print('''Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.''') desc_polityscore = data_graph['polityscore'].describe() print(desc_polityscore) #Univariate histogram for quantitative variable: plt.figure() sns.distplot(data_graph["polityscore"].dropna(), kde=False); plt.xlabel('Overall polity score') plt.title('Histogram of polityscore in the gapminder dataset') plt.show() plt.close() print(_question) print(_hypothesis) #basic scatterplot: Q->Q plt.figure() scat1 = sns.regplot(x="internetuserate", y="polityscore", data=data_graph) plt.xlabel('Internet Use Rate') plt.ylabel('Overall polity score') plt.title('Scatterplot for the Association Between Internet Use Rate and' \ 'Overall polity score') plt.show() plt.close()

Result:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/week4.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson ############################################ ## Explanatory Variable = Internet Use ############################################ ==Descrive internetuserate == Internet users (per 100 people) Internet users are people with access to the worldwide network. count 192.000000 mean 35.632716 std 27.780285 min 0.210066 25% 9.999604 50% 31.810121 75% 56.416046 max 95.638113 Name: internetuserate, dtype: float64 ############################################ ## Response Variable = polityscore ############################################ ==Descrive polityscore== Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. count 161.000000 mean 3.689441 std 6.314899 min -10.000000 25% -2.000000 50% 6.000000 75% 9.000000 max 10.000000 Name: polityscore, dtype: float64 Can the Internet use democratizes society? I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions.

Analysis Results

Frequency Explanatory Variable = Internet Use

This graph is unimodal, with its highest peak at the start category of 0 to 20 internet user per 100 people. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.

Frequency Response Variable = polityscore

This graph is unimodal, with its highest peak at the qualify og 10. It seems to be skewed to the left as there are higher frequencies in the higher score.

association between your explanatory and response variables

The graph above plots the Overall polity score of a country to the country’s corresponding to Internet Use Rate. We can see that the scatter graph does not show a clear relationship/trend between the two variables.

Data Manage

Ok... for this leasson from data analisys use the data manage

Code:

#!/usr/bin/env python2.6 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week2.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Examine frequency distributions, for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 import pandas import numpy data = pandas.read_csv("gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) sub1=data[['country','internetuserate', 'polityscore', 'urbanrate', 'incomeperperson']] data_manage = sub1.copy() # quartile split (use qcut function & ask for 4 groups - gives you quartile split) print('intusegroup4 - 10 categories - 0 low use, 10 is massive use') data_manage['intusegroup4']=pandas.cut(data_manage.internetuserate, [0, 10 , 20 , 30, 40, 50, 60, 70 , 80, 90, 100]) print('\n==Counts for internetuserate - grouped as intusegroup4 ==') print('Internet users (per 100 people) Internet users are people with access\n' 'to the worldwide network.') intusegroup4_count = data_manage['intusegroup4'].value_counts(sort=False) print (intusegroup4_count) print('\n==Percentages for internetuserate grouped as intusegroup4 ==') print('''Internet users (per 100 people) Internet users are people with access to the worldwide network.''') intusegroup4_percent = data_manage['intusegroup4'].value_counts(sort=False, normalize=True) print (intusegroup4_percent) #polityscore data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) print('\n==Counts for polityscore==') print('''Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.''') polityscore_count = data['polityscore'].value_counts(sort=False) print (polityscore_count) print('\n==Percentages for polityscore==') print('''Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.''') polityscore_percent = data['polityscore'].value_counts(sort=False, normalize=True) print (polityscore_percent) #urbanrate data_manage['urbanrate'] = data_manage['urbanrate'].convert_objects( convert_numeric=True) # quartile split (use qcut function & ask for 4 groups - gives you quartile split) print('urbanrategroup10 - 10 categories - 0 low urban rate, 10 is massive urban rate') data_manage['urbanrategroup10']=pandas.cut(data_manage.urbanrate, [0, 10 , 20 , 30, 40, 50, 60, 70 , 80, 90, 100]) print('\n==Counts for urbanrategroup10==') print('''Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) 10 categories - 0 low urban rate, 10 is massive urban rate''') urbanrategroup10_count = data_manage['urbanrategroup10'].value_counts(sort=False) print(urbanrategroup10_count) print('\n==Percentages for urbanrategroup10==') print('''Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects)''') urbanrategroup10_percent = data_manage['urbanrategroup10'].value_counts(sort=False, normalize=True) print (urbanrategroup10_percent) #incomeperperson data_manage['incomeperperson'] = data_manage['incomeperperson'].convert_objects( convert_numeric=True) print('incomeperpersongroup10 - 10 categories by quantiles - 0 low incomeperperson, 10 is massive incomeperperson') data_manage['incomeperpersongroup10']=pandas.qcut(data_manage.urbanrate, 10, labels=[0, 1, 2 , 3 , 4, 5, 6, 7, 8 , 9]) print('\n==Counts for incomeperpersongroup10==') print('''Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.''') incomeperpersongroup10_count = data_manage['incomeperpersongroup10'].value_counts(sort=False) print(incomeperpersongroup10_count) print('\n==Percentages for incomeperperson==') print('''Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.''') incomeperpersongroup10_percent = data_manage['incomeperpersongroup10'].value_counts(sort=False, normalize=True) print (incomeperpersongroup10_percent)

Result:

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/week3.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson intusegroup4 - 10 categories - 0 low use, 10 is massive use ==Counts for internetuserate - grouped as intusegroup4 == Internet users (per 100 people) Internet users are people with access to the worldwide network. (0, 10] 49 (10, 20] 27 (20, 30] 17 (30, 40] 18 (40, 50] 25 (50, 60] 9 (60, 70] 14 (70, 80] 16 (80, 90] 12 (90, 100] 5 dtype: int64 ==Percentages for internetuserate grouped as intusegroup4 == Internet users (per 100 people) Internet users are people with access to the worldwide network. (0, 10] 0.230047 (10, 20] 0.126761 (20, 30] 0.079812 (30, 40] 0.084507 (40, 50] 0.117371 (50, 60] 0.042254 (60, 70] 0.065728 (70, 80] 0.075117 (80, 90] 0.056338 (90, 100] 0.023474 dtype: float64 ==Counts for polityscore== Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. 0 6 1 3 2 3 3 2 4 4 5 7 6 10 7 13 8 19 9 15 10 33 -1 4 -10 2 -9 4 -8 2 -7 12 -6 3 -5 2 -4 6 -3 6 -2 5 dtype: int64 ==Percentages for polityscore== Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. 0 0.028169 1 0.014085 2 0.014085 3 0.009390 4 0.018779 5 0.032864 6 0.046948 7 0.061033 8 0.089202 9 0.070423 10 0.154930 -1 0.018779 -10 0.009390 -9 0.018779 -8 0.009390 -7 0.056338 -6 0.014085 -5 0.009390 -4 0.028169 -3 0.028169 -2 0.023474 dtype: float64 urbanrategroup10 - 10 categories - 0 low urban rate, 10 is massive urban rate ==Counts for urbanrategroup10== Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) 10 categories - 0 low urban rate, 10 is massive urban rate (0, 10] 0 (10, 20] 13 (20, 30] 22 (30, 40] 24 (40, 50] 22 (50, 60] 24 (60, 70] 34 (70, 80] 24 (80, 90] 21 (90, 100] 19 dtype: int64 ==Percentages for urbanrategroup10== Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) (0, 10] 0.000000 (10, 20] 0.061033 (20, 30] 0.103286 (30, 40] 0.112676 (40, 50] 0.103286 (50, 60] 0.112676 (60, 70] 0.159624 (70, 80] 0.112676 (80, 90] 0.098592 (90, 100] 0.089202 dtype: float64 incomeperpersongroup10 - 10 categories by quantiles - 0 low incomeperperson, 10 is massive incomeperperson ==Counts for incomeperpersongroup10== Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account. 0 21 1 20 2 20 3 20 4 21 5 20 6 20 7 20 8 20 9 21 dtype: int64 ==Percentages for incomeperperson== Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account. 0 0.098592 1 0.093897 2 0.093897 3 0.093897 4 0.098592 5 0.093897 6 0.093897 7 0.093897 8 0.093897 9 0.098592 dtype: float64

Conclusion/analysis:

I collapsed the responses for urbanrate, internetuse. to show 10 categories represent the percentage of urban rate and internet use. Use 10 categories for incomeperperson dividen in qunatiles. and set NaN for Politicscore to create three new variables. Is incredible too see 23% of countries have internet use in the lowes rate. 0-10

Examine frequency distributions

Examine frequency distributions, for variables internetuserate, polityscore urbanrate, incomeperperson.

#!/usr/bin/env python2.6 # vim: set fileencoding=utf-8 : # -*- coding: utf-8 -*- # # week2.py # Copyright 2015 arpagon # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. """ Examine frequency distributions, for variables internetuserate, polityscore urbanrate, incomeperperson. On dataset of gapminder """ __version__ = "0.0.1" __license__ = """The GNU General Public License (GPL-2.0)""" __author__ = "Sebastian Rojo [email protected]" __contributors__ = [] _debug = 0 import pandas import numpy data = pandas.read_csv("gapminder.csv", low_memory=False) print('Dataset gapminder \n') print('Obeservations:', len(data)) #number of observations (rows) print('Variables:', len(data.columns)) # number of variables (columns) print('\n\nVariables chosen for the study are: internetuserate, polityscore') print('Additional variables for the assignment: urbanrate, incomeperperson') #internetuserate data['internetuserate'] = data['internetuserate'].convert_objects( convert_numeric=True) print('\n==Counts for internetuserate==') print('Internet users (per 100 people) Internet users are people with access\n' 'to the worldwide network.') internetuserate_count = data['internetuserate'].value_counts(sort=False) print (internetuserate_count) print('\n==Percentages for internetuserate==') print('''Internet users (per 100 people) Internet users are people with access to the worldwide network.''') internetuserate_percent = data['internetuserate'].value_counts(sort=False, normalize=True) print (internetuserate_percent) #polityscore data['polityscore'] = data['polityscore'].convert_objects( convert_numeric=True) print('\n==Counts for polityscore==') print('''Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.''') polityscore_count = data['polityscore'].value_counts(sort=False) print (polityscore_count) print('\n==Percentages for polityscore==') print('''Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.''') polityscore_percent = data['polityscore'].value_counts(sort=False, normalize=True) print (polityscore_percent) #urbanrate data['urbanrate'] = data['urbanrate'].convert_objects( convert_numeric=True) print('\n==Counts for urbanrate==') print('''Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects)''') urbanrate_count = data['urbanrate'].value_counts(sort=False) print(urbanrate_count) print('\n==Percentages for urbanrate==') print('''Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects)''') urbanrate_percent = data['urbanrate'].value_counts(sort=False, normalize=True) print (urbanrate_percent) #incomeperperson data['incomeperperson'] = data['incomeperperson'].convert_objects( convert_numeric=True) print('\n==Counts for incomeperperson==') print('''Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.''') incomeperperson_count = data['incomeperperson'].value_counts(sort=False) print(incomeperperson_count) print('\n==Percentages for incomeperperson==') print('''Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.''') incomeperperson_percent = data['incomeperperson'].value_counts(sort=False, normalize=True) print (incomeperperson_percent)

runfile('/home/arpagon/Workspace/DataAnalysisSpecialization/src/week2.py', wdir='/home/arpagon/Workspace/DataAnalysisSpecialization/src') Dataset gapminder Obeservations: 213 Variables: 16 Variables chosen for the study are: internetuserate, polityscore Additional variables for the assignment: urbanrate, incomeperperson ==Counts for internetuserate== Internet users (per 100 people) Internet users are people with access to the worldwide network. 0.720009 1 1.400061 1 2.100213 1 3.654122 1 4.999875 1 5.098265 1 6.497924 1 7.232224 1 8.959140 1 9.999954 1 1.259934 1 11.090765 1 12.645733 1 13.598876 1 14.830736 1 ... 43.055067 1 61.987413 1 7.930096 1 26.477223 1 44.585355 1 2.199998 1 53.740217 1 29.879921 1 44.570074 1 40.020095 1 2.259976 1 6.965038 1 31.568098 1 20.663156 1 28.999477 1 Length: 192, dtype: int64 ==Percentages for internetuserate== Internet users (per 100 people) Internet users are people with access to the worldwide network. 0.720009 0.004695 1.400061 0.004695 2.100213 0.004695 3.654122 0.004695 4.999875 0.004695 5.098265 0.004695 6.497924 0.004695 7.232224 0.004695 8.959140 0.004695 9.999954 0.004695 1.259934 0.004695 11.090765 0.004695 12.645733 0.004695 13.598876 0.004695 14.830736 0.004695 ... 43.055067 0.004695 61.987413 0.004695 7.930096 0.004695 26.477223 0.004695 44.585355 0.004695 2.199998 0.004695 53.740217 0.004695 29.879921 0.004695 44.570074 0.004695 40.020095 0.004695 2.259976 0.004695 6.965038 0.004695 31.568098 0.004695 20.663156 0.004695 28.999477 0.004695 Length: 192, dtype: float64 ==Counts for polityscore== Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. 0 6 1 3 2 3 3 2 4 4 5 7 6 10 7 13 8 19 9 15 10 33 -1 4 -10 2 -9 4 -8 2 -7 12 -6 3 -5 2 -4 6 -3 6 -2 5 dtype: int64 ==Percentages for polityscore== Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. 0 0.028169 1 0.014085 2 0.014085 3 0.009390 4 0.018779 5 0.032864 6 0.046948 7 0.061033 8 0.089202 9 0.070423 10 0.154930 -1 0.018779 -10 0.009390 -9 0.018779 -8 0.009390 -7 0.056338 -6 0.014085 -5 0.009390 -4 0.028169 -3 0.028169 -2 0.023474 dtype: float64 ==Counts for urbanrate== Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) 84.54 1 15.10 1 36.82 1 30.88 1 93.32 1 74.92 1 29.54 1 10.40 1 71.40 1 73.64 1 13.22 1 14.32 1 77.20 1 51.64 1 17.00 1 ... 30.64 1 66.50 1 82.42 1 88.92 1 60.70 1 29.52 1 77.12 1 27.84 2 30.84 1 85.58 1 61.00 1 74.50 1 86.56 1 56.74 1 28.38 1 Length: 194, dtype: int64 ==Percentages for urbanrate== Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) 84.54 0.004695 15.10 0.004695 36.82 0.004695 30.88 0.004695 93.32 0.004695 74.92 0.004695 29.54 0.004695 10.40 0.004695 71.40 0.004695 73.64 0.004695 13.22 0.004695 14.32 0.004695 77.20 0.004695 51.64 0.004695 17.00 0.004695 ... 30.64 0.004695 66.50 0.004695 82.42 0.004695 88.92 0.004695 60.70 0.004695 29.52 0.004695 77.12 0.004695 27.84 0.009390 30.84 0.004695 85.58 0.004695 61.00 0.004695 74.50 0.004695 86.56 0.004695 56.74 0.004695 28.38 0.004695 Length: 194, dtype: float64 ==Counts for incomeperperson== Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account. 2668.020519 1 5634.003948 1 6147.779610 1 772.933345 1 26551.844238 1 1543.956457 1 13577.879885 1 115.305996 1 523.950151 1 33923.313868 1 1860.753895 1 5900.616944 1 20751.893424 1 786.700098 1 275.884287 1 ... 722.807559 1 5188.900935 1 32292.482984 1 495.734247 1 10480.817203 1 5528.363114 1 242.677534 1 2534.000380 1 16372.499781 1 2549.558474 1 760.262365 1 31993.200694 1 22275.751661 1 2557.433638 1 25249.986061 1 Length: 190, dtype: int64 ==Percentages for incomeperperson== Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account. 2668.020519 0.004695 5634.003948 0.004695 6147.779610 0.004695 772.933345 0.004695 26551.844238 0.004695 1543.956457 0.004695 13577.879885 0.004695 115.305996 0.004695 523.950151 0.004695 33923.313868 0.004695 1860.753895 0.004695 5900.616944 0.004695 20751.893424 0.004695 786.700098 0.004695 275.884287 0.004695 ... 722.807559 0.004695 5188.900935 0.004695 32292.482984 0.004695 495.734247 0.004695 10480.817203 0.004695 5528.363114 0.004695 242.677534 0.004695 2534.000380 0.004695 16372.499781 0.004695 2549.558474 0.004695 760.262365 0.004695 31993.200694 0.004695 22275.751661 0.004695 2557.433638 0.004695 25249.986061 0.004695 Length: 190, dtype: float64

#internet #internetuse #polityscore

Can the Internet use democratizes society?

I am convinced, a deep democracy promotes the welfare of all people actively involved in the democratic process. Some years ago while reading the book Why Nations Fail. Suggested the deep relationship between poverty and lawlessness in the building of democratic institutions.

Addition I am a hacktivist of the free Culture and I believe in the power of the Internet to make democratic changes on society.

To address this question, I will conduct research supported in the dataset of Gapminder.

The variables chosen are.

Internetuserate: 2010 Internet users (per 100 people) Internet users are people with access to the worldwide network.

polityscore: 2009 Democracy score (Polity) Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.

Some source of similar issues:

Can the Internet Democratize Capitalism? Autor:Yanis Varoufakis link: http://www.internationalpolicydigest.org/2014/02/22/can-internet-democratize-capitalism/

The Internet’s Effect on Civil Society Development & Democratization Autor: James Warycha link: http://www.albany.edu/honorscollege/files/Warycha_thesis.docx

El uso de internet democratiza una sociedad?

Estoy convencido que una profunda democracia favorece el bienestar de todas las personas que intervienen activamente en el proceso democrático. Hace algunos años al leer el libro Por Qué Fracasan los Países. Sugería la profunda relación entre pobreza e iniquidad con la construcción de instituciones democráticas.

Para abordar esta pregunta realizare una investigación apoyado en los daos de GapMinder.

#internet

Trending Blogs

Recently Viewed Blogs

Arpagon Data Analysis