Discover Top Posts Tagged with #pearson correlation

Running a Pearson correlation in Python

We used the following syntax to assess the relationship between income per person and urban rate from gapminder dataset.

The following results were obtained

Interpretation of the results

The analysis shows a moderate positive correlation between income per person and urban rate (r = 0.49). This means that, countries with higher average incomes tend to have a larger share of their population living in urban areas. The relationship is statistically significant (p<0.001).

The scatter plot supports this result by showing an overall upward trend, alongside wide variation, especially among low-income countries. Many countries with similar income levels have very different urban rates, and a few high-income countries stand out as outliers. Overall, the plot suggests that income is moderately associated with urban rate.

#coursera #wesleyanuniversity #Pearson Correlation

Per-Capita Income and Life Expectancy : W3 Data Analysis Tools

For the third week’s assignment of Data Analysis Tool on Coursera, we would continue to be working with GapMinder's dataset which contains statistics in the social, economic, and environmental development variable at local, national, and global levels

We would be studying the effect of Income per Person of a County on prevalent rates of life-expectancy. Since both the explanatory variable (Per-Capita Income) and the response variable are quantitative we'll calculate the Pearson Correlation Coefficient to analyze the strength of correlation between the variables.

The Correlation Analysis between the two variables gives :

The Correlation Coefficient is 0.60 with a very low p-value << 0.0001, which indicates a considerably strong and significant relation between the Per-Capita Income and the Life Expectancy of individuals. A positive Correlation Coefficient indicates that the Life-Expectancy increases with the Per-Capita Income of a Country.

However, looking at the scatter-plot between the two variables, we see that a sharp increase in the life expectancy is seen only at the very low end of the per-capita income spectrum. Beyond a per-capita income of 10000, the life-expectancy almost flattens out. So, we need to understand the strong relationship between the variables together the scatter-plot.

#coursera #data analysis tools #pearson correlation #correlation coefficient

Generating a Correlation Coefficient Assignment

My Program:

Program Output and Summary:

Pearson Correlation Test Results for life expectancy and HIV rate

A Pearson Correlation test revealed an association between life expectancy and HIV rates that is statistically significant, p-value approx. 1.55e-13, and moderately negative, coefficient approx. -0.56, which is clearly illustrated by the scatterplot graph above.

Thus, knowing the life expectancy will help with predicting approx. 31 percent of the variability seen in HIV rates, coefficient squared is approx. 0.3177.

Pearson Correlation Test Results for life expectancy and breast cancer cases

A Pearson Correlation test revealed an association between life expectancy and breast cancer cases that is statistically significant, p-value approx. 1.54e-19, and moderately positive, coefficient approx. 0.66, which is clearly illustrated by the scatterplot graph above.

Thus, knowing the life expectancy will help with predicting approx. 43 percent of the variability seen in breast cancer cases, coefficient squared is approx. 0.4364.

#data analytics #datascience #Pearson Correlation

Week 3: Generating a Pearson Correlation Coefficient

The process of generating a correlation coefficient is used to examine the dependence between a quantative explanatory variable and a quantative response value.

Q->Q

For this course, I will perform a Pearson correlation and analyze the results.

A correlation can be made visually with a scatterplott. With this one can see a general form /shape of the value mid line.

A pearson correlation coefficient is generally useful only when looking at linear shaped scatterplots, for curved it gives no good significant findings.

The coefficient “r” can range from -1 to +1, while a value next to (+/-)1 indicates a perfect relation between the variables, whilst a value near 0 indicate a very weak connection between the variables examined.

For my test I will use the gapminder dataset, as it already contains several quantative variables sorted to different countries of origin for the collected data.

I will examine to correlations, first between braest cancer rate “breastcancerper100TH” and alcohol consumption rate “alcconsumption”, second between breast cancer rate and rate of individuals living in urban areas “urbanrate“.

To perform that with SAS this code is used:

The Proc Corr statement provides the correlation between (in this case 3) variables in an output table.

For visuallizing I included scatterplots for both relations I try to examine (breastcancer100TH as explanatory variable on the x-axis for both plots).

The output is the following:

We find for relation between breast cancer and alcohol consumption a r=0.493 with a p-value of <0.0001 indicating a moderate positive relation.

For the relation between breast cancer and urbanrate a r=0.57 with a p-value of <0.0001 indicating a moderately strong relation between the variables.

When we square the examined r value, we get Coefficient of Determination (RSqaure) which tells us, how many values of the second variable we can predict with the first variable.

Here r-square for breastcancer and alcconsumption is 0.243, so we could predict about 24.3% of the breast cancer cases with the alcohol consumption rate.

The r-square for breastcancer and urbanrate is 0.325, so we could predict about 32.5% of the breast cancer cases with the urban rate.

So we can say that the higher the urbanrate or the alcohol consumption rate, the higher the breast cancer rate will be.

#pearson correlation #correlation coefficient #coursera #data analysis #sas

The effect of alcohol consumption on suicide rate

For my research , i am planning to investigate the effect of Alcohol consumption on the suicide rate . For my investigation, i will use the data gathered by the gap minder foundation for the average suicide and the average alcohol consumption rate across 170 countries gathered from the UN .

The average suicide rate per 100 person and the average alcohol consumption rate in liter per month are considered as a quantitative values. in order to identify the relationship between two quantitative values, we will use Pearson’s correlation , and thus identifying how significant and strong is the relationship between them .

The program i am using is the Python for coding, and the syntax for the code is shown below

# -*- coding: utf-8 -*- """ Created on Wed Aug 12 18:24:12 2020

@author: omar.elfarouk """

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

#setting variables you will be working with to numeric

data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')

data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce') data['suicideper100th'] = pandas.to_numeric(data['suicideper100th'], errors='coerce')

data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan) data['alcconsumption']=data['alcconsumption'].replace(' ', numpy.nan) data['suicideper100th']=data['suicideper100th'].replace(' ', numpy.nan)

#Plotting figure scat1 = seaborn.regplot(x="incomeperperson", y="alcconsumption", fit_reg=True, data=data) plt.xlabel('incomeperperson') plt.ylabel('Alcoholuse') plt.title('Scatterplot for the Association Between income per personand Alcohol usage')

scat2 = seaborn.regplot(x="incomeperperson", y="suicideper100th", fit_reg=True, data=data) plt.xlabel('Income per Person') plt.ylabel('suicideper100th') plt.title('Scatterplot for the Association Between Income per Person and Suicide Rate')

scat3 = seaborn.regplot(x="alcconsumption", y="suicideper100th", fit_reg=True, data=data) plt.xlabel('Alcohol usage') plt.ylabel('suicideper100th') plt.title('Scatterplot for the Association Between Alcohol usage and Suicide Rate')

#Cleaning data data_clean=data.dropna()

#Applying pearson correlation print ('association between Income per person and Alcohole isage') print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['alcconsumption']))

print ('association between incomeperperson and suscide rate ') print (scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['suicideper100th']))

print ('association between Alcohol usage and suscide rate ') print (scipy.stats.pearsonr(data_clean['alcconsumption'], data_clean['suicideper100th']))

Regarding the average income and the alcohol correlation , there is a strong correlation as shown in the figure below

The relationship between income and suicide shows a weak correlation and insignificant effect as shown in the figure below

the relationship between the alcohol usage and the suicide rate displays a strong correlation as shown in the figure below.

The Pearson correlation ,and the p-values are displayed as shown in the text below

association between Income per person and Alcohol usage (0.29177119807858265, 0.00010297020763990634)

Thus ,there is a weak positive correlation beween income per person and alcohol usage, However. it is considered as a significant as the p value is below 0.05, thus we can safely reject the null hypothesis and claim that there is a significant relation between the Alcohol usage and the income

association between income per person and suicide rate (0.0060833598985392135, 0.9368723631391026)

Thus ,there is a weak positive correlation between the income and the suicide rate, and also it is considered insignificant as the p value is equal to 0.93, thus we can safely accept the null hypothesis and claim that there no significant relation between the average income and the suicide rate. association between Alcohol usage and suicide rate (0.3874255193053243, 1.5168064802517918e-07)

Thus ,there is a weak positive correlation between suicide rate and alcohol usage, However. it is considered as a significant as the p value is below 0.05, thus we can safely reject the null hypothesis and claim that there is a significant relation between the Alcohol usage and the suicide rate.

#Pearson correlation #Data analysis #Coursera #Quantitative

How are life expectancy and female employ rate associated with the overall employ rate of the country?

In this short study, I try to figure out relation between the life expectancy vs the overall employment rate and Female employment rate vs the overall employment rate. I used Pearson Correlation to support the data analysis.

Here I used following as my explanatory and response variables for Case 1 and Case 2:

Case 1 Explanatory Variable: overall employment rate (Quantitative)

Response Variable: Female employment rate (Quantitative)

Case 2 Explanatory Variable: overall employment rate (Quantitative)

Response Variable: life expectancy (Quantitative) Note both the Explanatory and Response variables are Quantitative.

The null hypothesis in this case will be: H_0 : There is no linear relation between the explanatory and response variables.

And the alternate hypothesis in this case will be H_a : There is linear relation between the explanatory and response variables.

GapMinder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. It seeks to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels. The GapMinder data includes one year of numerous country-level indicators of health, wealth and development

According to Pearson Correlation on GapMinder Data, it was found that:

The P value is lower than 0.05 in both the cases which gives us significant confidence to conclude that null hypothesis can be rejected in both the cases.

Pearson ‘r’ takes value from -1 to 1. Where negative values indicate linear decremental relationship and vice versa. Values closer to 0 (Zero) indicate weak linear relationship i.e the graph will more scattered while the values closer to 1 or -1 indicate stronger linear relationship. In other words, looking at the Pearson ‘r’ values and p values of the both the cases we could conclude that

CASE1: There is a strong incremental linear relationship between the female employment rate and overall employment rate in the countries. i.e. countries having better overall employment rates tend to have better female employment rates.

CASE2: There is a moderately weak decremental linear relationship between the life expectancy and overall employment rate in the countries. i.e. countries having better overall employment rates tend to have slightly worse life expectancy.

Note: Each dot in the above scatter plots represents each country in the world.

Importance of ‘r’ value :

r^2 (r square) value indicates the fraction of variability of one variable that can be predicted by other variable.

This means that, For CASE1 : r^2 = 0.736 i.e If we know the overall employment rate of the country, we can predict 73.6% of the variability we will see in the rate of female employment.

For CASE2 : r^2 = 0.106 i.e If we know the overall employment rate of the country, we can predict only 10.6% of the variability we will see in the life expectancy.

----------------------------------------------------------------------------------------------------------

Data used:

#Data Analysis Tools #Data Analysis #Pearson Correlation #Saurabh3494

Describing Bivariate data - Values of the Pearson Correlation

Describing Bivariate data – Values of the Pearson Correlation

Learning Objectives

Describe what Pearson’s correlation measures

Give the symbols for Pearson’s correlation in the sample and in the population

State the possible range for Pearson’s correlation

Identify a perfect linear relationship

The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. It is referred to as Pearson’s…

View On WordPress

#Pearson Correlation

Data Analysis Tools - Week 3 - Study Notes

Pearson Correlation Coefficient (r) measures a linear relationship between two quantitative variables.

Numerical measure of a linear relationship between two quantitative variables: Pearson Correlation Coefficient (r).

The value of r ranges from -1 to 1.

The correlation only measures the strength of a linear relationship between two variables.

Correlation ignores any other type of relationship no matter how strong.

r close to 0 indicates a weak linear relationship.

Post hoc tests are not necessary when conducting Pearson correlation.

Post hoc tests: Categorical explanatory variable with more than two levels.

R2 = the fraction of the variability of one variable that can be predicted by the other.

#inferential statistics #pearson correlation #coursera