We used the following syntax to assess the relationship between income per person and urban rate from gapminder dataset.
The following results were obtained
Interpretation of the results
The analysis shows a moderate positive correlation between income per person and urban rate (r = 0.49). This means that, countries with higher average incomes tend to have a larger share of their population living in urban areas. The relationship is statistically significant (p<0.001).
The scatter plot supports this result by showing an overall upward trend, alongside wide variation, especially among low-income countries. Many countries with similar income levels have very different urban rates, and a few high-income countries stand out as outliers. Overall, the plot suggests that income is moderately associated with urban rate.
Per-Capita Income and Life Expectancy : W3 Data Analysis Tools
For the third week’s assignment of Data Analysis Tool on Coursera, we would continue to be working with GapMinder's dataset which contains statistics in the social, economic, and environmental development variable at local, national, and global levels
We would be studying the effect of Income per Person of a County on prevalent rates of life-expectancy. Since both the explanatory variable (Per-Capita Income) and the response variable are quantitative we'll calculate the Pearson Correlation Coefficient to analyze the strength of correlation between the variables.
The Correlation Analysis between the two variables gives :
The Correlation Coefficient is 0.60 with a very low p-value << 0.0001, which indicates a considerably strong and significant relation between the Per-Capita Income and the Life Expectancy of individuals. A positive Correlation Coefficient indicates that the Life-Expectancy increases with the Per-Capita Income of a Country.
However, looking at the scatter-plot between the two variables, we see that a sharp increase in the life expectancy is seen only at the very low end of the per-capita income spectrum. Beyond a per-capita income of 10000, the life-expectancy almost flattens out. So, we need to understand the strong relationship between the variables together the scatter-plot.
Pearson Correlation Test Results for life expectancy and HIV rate
A Pearson Correlation test revealed an association between life expectancy and HIV rates that is statistically significant, p-value approx. 1.55e-13, and moderately negative, coefficient approx. -0.56, which is clearly illustrated by the scatterplot graph above.
Thus, knowing the life expectancy will help with predicting approx. 31 percent of the variability seen in HIV rates, coefficient squared is approx. 0.3177.
Pearson Correlation Test Results for life expectancy and breast cancer cases
A Pearson Correlation test revealed an association between life expectancy and breast cancer cases that is statistically significant, p-value approx. 1.54e-19, and moderately positive, coefficient approx. 0.66, which is clearly illustrated by the scatterplot graph above.
Thus, knowing the life expectancy will help with predicting approx. 43 percent of the variability seen in breast cancer cases, coefficient squared is approx. 0.4364.
Week 3: Generating a Pearson Correlation Coefficient
The process of generating a correlation coefficient is used to examine the dependence between a quantative explanatory variable and a quantative response value.
Q->Q
For this course, I will perform a Pearson correlation and analyze the results.
A correlation can be made visually with a scatterplott. With this one can see a general form /shape of the value mid line.
A pearson correlation coefficient is generally useful only when looking at linear shaped scatterplots, for curved it gives no good significant findings.
The coefficient “r” can range from -1 to +1, while a value next to (+/-)1 indicates a perfect relation between the variables, whilst a value near 0 indicate a very weak connection between the variables examined.
For my test I will use the gapminder dataset, as it already contains several quantative variables sorted to different countries of origin for the collected data.
I will examine to correlations, first between braest cancer rate “breastcancerper100TH” and alcohol consumption rate “alcconsumption”, second between breast cancer rate and rate of individuals living in urban areas “urbanrate“.
To perform that with SAS this code is used:
The Proc Corr statement provides the correlation between (in this case 3) variables in an output table.
For visuallizing I included scatterplots for both relations I try to examine (breastcancer100TH as explanatory variable on the x-axis for both plots).
The output is the following:
We find for relation between breast cancer and alcohol consumption a r=0.493 with a p-value of <0.0001 indicating a moderate positive relation.
For the relation between breast cancer and urbanrate a r=0.57 with a p-value of <0.0001 indicating a moderately strong relation between the variables.
When we square the examined r value, we get Coefficient of Determination (RSqaure) which tells us, how many values of the second variable we can predict with the first variable.
Here r-square for breastcancer and alcconsumption is 0.243, so we could predict about 24.3% of the breast cancer cases with the alcohol consumption rate.
The r-square for breastcancer and urbanrate is 0.325, so we could predict about 32.5% of the breast cancer cases with the urban rate.
So we can say that the higher the urbanrate or the alcohol consumption rate, the higher the breast cancer rate will be.
For my research , i am planning to investigate the effect of Alcohol consumption on the suicide rate . For my investigation, i will use the data gathered by the gap minder foundation for the average suicide and the average alcohol consumption rate across 170 countries gathered from the UN .
The average suicide rate per 100 person and the average alcohol consumption rate in liter per month are considered as a quantitative values. in order to identify the relationship between two quantitative values, we will use Pearson’s correlation , and thus identifying how significant and strong is the relationship between them .
The program i am using is the Python for coding, and the syntax for the code is shown below
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 12 18:24:12 2020
#Plotting figure
scat1 = seaborn.regplot(x="incomeperperson", y="alcconsumption", fit_reg=True, data=data)
plt.xlabel('incomeperperson')
plt.ylabel('Alcoholuse')
plt.title('Scatterplot for the Association Between income per personand Alcohol usage')
scat2 = seaborn.regplot(x="incomeperperson", y="suicideper100th", fit_reg=True, data=data)
plt.xlabel('Income per Person')
plt.ylabel('suicideper100th')
plt.title('Scatterplot for the Association Between Income per Person and Suicide Rate')
scat3 = seaborn.regplot(x="alcconsumption", y="suicideper100th", fit_reg=True, data=data)
plt.xlabel('Alcohol usage')
plt.ylabel('suicideper100th')
plt.title('Scatterplot for the Association Between Alcohol usage and Suicide Rate')
#Cleaning data
data_clean=data.dropna()
#Applying pearson correlation
print ('association between Income per person and Alcohole isage')
print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['alcconsumption']))
print ('association between incomeperperson and suscide rate ')
print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['suicideper100th']))
print ('association between Alcohol usage and suscide rate ')
print (scipy.stats.pearsonr(data_clean['alcconsumption'], data_clean['suicideper100th']))
Regarding the average income and the alcohol correlation , there is a strong correlation as shown in the figure below
The relationship between income and suicide shows a weak correlation and insignificant effect as shown in the figure below
the relationship between the alcohol usage and the suicide rate displays a strong correlation as shown in the figure below.
The Pearson correlation ,and the p-values are displayed as shown in the text below
association between Income per person and Alcohol usage
(0.29177119807858265, 0.00010297020763990634)
Thus ,there is a weak positive correlation beween income per person and alcohol usage, However. it is considered as a significant as the p value is below 0.05, thus we can safely reject the null hypothesis and claim that there is a significant relation between the Alcohol usage and the income
association between income per person and suicide rate
(0.0060833598985392135, 0.9368723631391026)
Thus ,there is a weak positive correlation between the income and the suicide rate, and also it is considered insignificant as the p value is equal to 0.93, thus we can safely accept the null hypothesis and claim that there no significant relation between the average income and the suicide rate.
association between Alcohol usage and suicide rate
(0.3874255193053243, 1.5168064802517918e-07)
Thus ,there is a weak positive correlation between suicide rate and alcohol usage, However. it is considered as a significant as the p value is below 0.05, thus we can safely reject the null hypothesis and claim that there is a significant relation between the Alcohol usage and the suicide rate.
How are life expectancy and female employ rate associated with the overall employ rate of the country?
In this short study, I try to figure out relation between the life expectancy vs the overall employment rate and Female employment rate vs the overall employment rate. I used Pearson Correlation to support the data analysis.
Here I used following as my explanatory and response variables for Case 1 and Case 2:
Case 1
Explanatory Variable: overall employment rate (Quantitative)
Case 2
Explanatory Variable: overall employment rate (Quantitative)
Response Variable: life expectancy (Quantitative)
Note both the Explanatory and Response variables are Quantitative.
The null hypothesis in this case will be:
H_0 : There is no linear relation between the explanatory and response variables.
And the alternate hypothesis in this case will be
H_a : There is linear relation between the explanatory and response variables.
GapMinder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. It seeks to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels. The GapMinder data includes one year of numerous country-level indicators of health, wealth and development
According to Pearson Correlation on GapMinder Data, it was found that:
The P value is lower than 0.05 in both the cases which gives us significant confidence to conclude that null hypothesis can be rejected in both the cases.
Pearson ‘r’ takes value from -1 to 1. Where negative values indicate linear decremental relationship and vice versa. Values closer to 0 (Zero) indicate weak linear relationship i.e the graph will more scattered while the values closer to 1 or -1 indicate stronger linear relationship.
In other words, looking at the Pearson ‘r’ values and p values of the both the cases we could conclude that
CASE1: There is a strong incremental linear relationship between the female employment rate and overall employment rate in the countries. i.e. countries having better overall employment rates tend to have better female employment rates.
CASE2: There is a moderately weak decremental linear relationship between the life expectancy and overall employment rate in the countries. i.e. countries having better overall employment rates tend to have slightly worse life expectancy.
Note: Each dot in the above scatter plots represents each country in the world.
Importance of ‘r’ value :
r^2 (r square) value indicates the fraction of variability of one variable that can be predicted by other variable.
This means that,
For CASE1 : r^2 = 0.736 i.e If we know the overall employment rate of the country, we can predict 73.6% of the variability we will see in the rate of female employment.
For CASE2 : r^2 = 0.106 i.e If we know the overall employment rate of the country, we can predict only 10.6% of the variability we will see in the life expectancy.
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce')
data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce')
plt.title('Scatterplot for the Association Between employrate and femaleemployrate')
f2 = plt.figure(2)
scat2 = seaborn.regplot(x="employrate", y="lifeexpectancy", fit_reg=True, data=data)
plt.xlabel('employrate')
plt.ylabel('lifeexpectancy')
plt.title('Scatterplot for the Association Between employrate and lifeexpectancy')
plt.show()
data_clean=data.dropna()
print ('association between employrate and femaleemployrate')
C1 = scipy.stats.pearsonr(data_clean['employrate'], data_clean['femaleemployrate'])
print (C1)
i1, i2 = C1
i1 = (i1)**2
print ("Pearson r^2 for case 1 is : {}".format(i1))
print ('association between employrate and lifeexpectancy')
C2 = scipy.stats.pearsonr(data_clean['employrate'], data_clean['lifeexpectancy'])
print (C2)
j1, j2 = C2
j1 = (j1)**2
print ("Pearson r^2 for case 2 is : {}".format(j1))
Code output:
association between employrate and femaleemployrate
(0.8580995574610076, 3.007544561131105e-52)
Pearson r^2 for case 1 is : 0.7363348505147771
association between employrate and lifeexpectancy
(-0.32578397096782186, 1.0232954133795836e-05)
Pearson r^2 for case 2 is : 0.1061351957395626
Describing Bivariate data - Values of the Pearson Correlation
Describing Bivariate data – Values of the Pearson Correlation
Learning Objectives
Describe what Pearson’s correlation measures
Give the symbols for Pearson’s correlation in the sample and in the population
State the possible range for Pearson’s correlation
Identify a perfect linear relationship
The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. It is referred to as Pearson’s…