Discover Top Posts Tagged with #regmodprac

Logistic Regression - R

As mostly mentioned first, I want to compare Python and R analysis steps in the DataManViz, DataAnaT, and RegModPrac projects. Therefore, this is the R version of the Logistic Regression Python script I posted before. Here, I'll use logistic regression to test the association between internet use rate (my response variable, this time binned into two categories) and multiple explanatory variables - but first and foremost new breast cancer cases. Again, the whole thing will look better over here. I had to switch back to using RMarkdown, since Jupyter had some problems I do not yet understand.

I will first run some of my previous code to remove variables I don't need and observations for which important data is missing.

# load data gapminder <- read.table("../gapminder.csv", sep = ",", header = TRUE, quote = "\"") # set row names rownames(gapminder) <- gapminder$country # subset data sub_data <- subset(gapminder, select = c("breastcancerper100th", "urbanrate", "internetuserate", "incomeperperson")) # remove rows with NAs sub_data2 <- na.omit(sub_data)

Internet usage, my response variable, has to be in a binary format for logistic regression to work. In my Python script, I decided to use the 25% quartile (9.1) as cut-off, so I'll do the same here.

# bin response variable sub_data2$internetBin <- as.numeric(sub_data2$internetuserate > 9.1) summary(sub_data2)

For the binning, I'm using a simple test whether the internet usage in a country is above my threshold or not. The as.numeric() will automatically convert every returned TRUE to a 1, and every FALSE to a 0 -- and done is my conversion.

Let's go ahead with the logistic regression!

fit1 <- glm(internetBin ~ breastcancerper100th, data = sub_data2, family = "binomial") summary(fit1)

Call: glm(formula = internetBin ~ breastcancerper100th, family = "binomial", data = sub_data2) Deviance Residuals: Min 1Q Median 3Q Max -1.9427 -0.3112 0.2517 0.7259 1.7188 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.89496 0.57693 -3.285 0.00102 ** breastcancerper100th 0.10578 0.02285 4.630 3.66e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 183.87 on 162 degrees of freedom Residual deviance: 135.78 on 161 degrees of freedom AIC: 139.78 Number of Fisher Scoring iterations: 6

The model results look quite different from those returned by Python. Already familiar are the coefficients shown in the middle of the results, which say that there is a significant, positive association between internet usage and breast cancer. From the parameter estimate, the odds ratio can again be calculated, but for that it might be nice to have the confidence intervals as well. Luckily, these are relatively easy to obtain:

conf1 <- confint(fit1)

Waiting for profiling to be done...

conf1 <- cbind(conf1, OR = coef(fit1)) print("odds ratio with confidence intervals") exp(conf1)

[1] "odds ratio with confidence intervals" 2.5 % 97.5 % OR (Intercept) 0.04444601 0.432423 0.1503249 breastcancerper100th 1.06804211 1.168457 1.1115793

The values are the same as in Python, so the OR > 1 again signifies that internet usage will be higher in countries with higher breast cancer prevalence.

Below the coefficients, the R function lists deviance values. I am not really sure what they signify in the context of logistic regression, but apparently the goal of the model is to reduce them as much as possible. How successful the model is at that task can be analysed with an a deviance table using the Chi square test to calculate test statistics:

anova(fit1, test = "Chisq")

Analysis of Deviance Table Model: binomial, link: logit Response: internetBin Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 162 183.87 breastcancerper100th 1 48.088 161 135.78 4.076e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null model uses only the intercept and no explanatory variable, resulting in a high deviance. With breast cancer as explanatory variable, a significant reduction of deviance takes place. If the model would contain more explanatory variables, the test would add them one at a time and determine their effect on the deviance. Time to add another explanatory variable!

fit2 <- glm(internetBin ~ breastcancerper100th + incomeperperson, data = sub_data2, family = "binomial") summary(fit2)

Call: glm(formula = internetBin ~ breastcancerper100th + incomeperperson, family = "binomial", data = sub_data2) Deviance Residuals: Min 1Q Median 3Q Max -2.50062 -0.31160 0.08903 0.58516 1.78481 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0100509 0.6218483 -3.232 0.00123 ** breastcancerper100th 0.0761711 0.0256152 2.974 0.00294 ** incomeperperson 0.0004431 0.0001703 2.602 0.00927 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 183.87 on 162 degrees of freedom Residual deviance: 120.18 on 160 degrees of freedom AIC: 126.18 Number of Fisher Scoring iterations: 8

In Python, using both breast cancer and income as explanatory variables resulted in a warning about quasi complete separation of the two internet usage groups. This is not the case here, I wonder why...

Let's have a look at the odds ratios and deviance again...

conf2 <- confint(fit2)

Waiting for profiling to be done...

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

conf2 <- cbind(conf2, OR = coef(fit2)) print("odds ratio with confidence intervals") exp(conf2)

[1] "odds ratio with confidence intervals" 2.5 % 97.5 % OR (Intercept) 0.03603527 0.4218608 0.1339819 breastcancerper100th 1.03010585 1.1398341 1.0791472 incomeperperson 1.00016777 1.0008310 1.0004432

Ah, here we go! This warning indicates that there is (quasi) complete separation happening, so I'll just move on to my third model, using breast cancer and urbanisation as explanatory variables.

fit3 <- glm(internetBin ~ breastcancerper100th + urbanrate, data = sub_data2, family = "binomial") summary(fit3)

Call: glm(formula = internetBin ~ breastcancerper100th + urbanrate, family = "binomial", data = sub_data2) Deviance Residuals: Min 1Q Median 3Q Max -2.1845 -0.2828 0.2272 0.5919 1.9885 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.38831 0.72676 -4.662 3.13e-06 *** breastcancerper100th 0.07813 0.02318 3.370 0.000752 *** urbanrate 0.04759 0.01250 3.808 0.000140 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 183.87 on 162 degrees of freedom Residual deviance: 118.51 on 160 degrees of freedom AIC: 124.51 Number of Fisher Scoring iterations: 6

All right, what do the additional tests say this time?

conf3 <- confint(fit3)

Waiting for profiling to be done...

conf3 <- cbind(conf3, OR = coef(fit3)) print("odds ratio with confidence intervals") exp(conf3)

[1] "odds ratio with confidence intervals" 2.5 % 97.5 % OR (Intercept) 0.007270543 0.1280973 0.03376562 breastcancerper100th 1.037775543 1.1373111 1.08126415 urbanrate 1.024544428 1.0763632 1.04873683

print("Chi-squared test of deviance")

[1] "Chi-squared test of deviance"

anova(fit3, test = "Chisq")

Analysis of Deviance Table Model: binomial, link: logit Response: internetBin Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 162 183.87 breastcancerper100th 1 48.088 161 135.78 4.076e-12 *** urbanrate 1 17.267 160 118.51 3.247e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The odds ratio for an association between breast cancer and internet usage decreased a bit when urbanisation was added to the model -- as before. On the other hand, the residual deviance also decreases significantly upon addition of that second explanatory variable, so it definitely improves the model, while not being a confounder (since breast cancer is still significantly associated with internet usage).

We can also subtract the null deviance from the other residual deviances calculated here, to have a simple view of the effect of the different explanatory variables. A high deviance is problematic, and since usually the null deviance (with the model using only the intercept) is higher than the deviance when explanatory variables are included, the differences shows just how much the model improved upon the addition of the variable.

print("difference in residual deviance using only breast cancer") with(fit1, null.deviance - deviance) print("difference in residual deviance using breast cancer and internet usage") with(fit2, null.deviance - deviance) print("difference in residual deviance using breast cancer and urbanisation") with(fit3, null.deviance - deviance)

[1] "difference in residual deviance using only breast cancer" [1] 48.08763 [1] "difference in residual deviance using breast cancer and internet usage" [1] 63.68923 [1] "difference in residual deviance using breast cancer and urbanisation" [1] 65.35506

The addition of other explanatory variables definitely improved my logistic model, but overall I think it's nowhere near the multiple linear regression which can directly work with my continuous data.

#RegModPrac #BCCIU #Logistic Regression #R

Logistic Regression - Python

I finally made it to week four of Regression Modelling in Practice! This is the last step in the regression analyses of my Breast Cancer Causes Internet Usage! (BCCIU) project, and once more I am forced to bin my quantitative response variable (I'm again only using internet usage) into two categories. This way, I'll be able to test a logistic regression, which works with binary (0/1) response variables. In the assignment, we're also tasked to check for confounding, so I will use logistic multiple regression as well, with the same variables I used for my multiple linear regression analysis.

The output will look better on GitHub than on tumblr.

First up comes the code to prepare the raw data, filtering for

internet usage (internet users per 100 people in 2010), my response variable, as well as

breast cancer (new breast cancer cases per 100k females in 2002),

income (Gross Domestic Product per capita in 2010), and

urbanisation (urban population as percent of total population in 2008) as explanatory variables.

# activate inline plotting, should be first statement %matplotlib inline # load packages import warnings # ignore warnings (e.g. from future, deprecation, etc.) warnings.filterwarnings('ignore') # for layout reasons, after I read and acknowledged them all! import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf # read in data data = pandas.read_csv("../gapminder.csv", low_memory=False) # use country names as row names/indices data.index = data["country"] data.drop("country", axis=1) # subset the data and make a copy to avoid error messages later on sub = data[["breastcancerper100th", "incomeperperson", "internetuserate", "urbanrate"]] sub_data = sub.copy() # change data types to numeric sub_data["breastcancerper100th"] = pandas.to_numeric(sub_data["breastcancerper100th"], errors="coerce") sub_data["incomeperperson"] = pandas.to_numeric(sub_data["incomeperperson"], errors="coerce") sub_data["internetuserate"] = pandas.to_numeric(sub_data["internetuserate"], errors="coerce") sub_data["urbanrate"] = pandas.to_numeric(sub_data["urbanrate"], errors="coerce") # remove rows with missing values (copy again) sub2 = sub_data.dropna() sub_data2 = sub2.copy()

As I stated before, my response variable needs to be in a "presence/absence" format this time, which will be coded as "1" for presence and "0" for absence of internet usage. But where should I set the cut-off?

# have a look at the data print(sub_data2.describe())

breastcancerper100th incomeperperson internetuserate urbanrate count 163.000000 163.000000 163.000000 163.000000 mean 37.781595 7312.376683 33.747359 56.245767 std 23.122332 10467.625388 27.868070 22.943194 min 3.900000 103.775857 0.720009 10.400000 25% 20.600000 691.093623 9.102256 36.840000 50% 30.300000 2425.471293 28.731883 59.460000 75% 50.350000 8880.432040 52.513403 73.490000 max 101.100000 52301.587179 95.638113 100.000000

I think I'll use the first quartile this time. That way, in countries with less than 9.1% internet usage, "internet usage will be absent".

# bin internet usage sub_data2["internetBin"] = numpy.where(sub_data2["internetuserate"] > 9.1, 1, 0) # examine data summary print("data with binned response variable") print(sub_data2.describe())

data with binned response variable breastcancerper100th incomeperperson internetuserate urbanrate \ count 163.000000 163.000000 163.000000 163.000000 mean 37.781595 7312.376683 33.747359 56.245767 std 23.122332 10467.625388 27.868070 22.943194 min 3.900000 103.775857 0.720009 10.400000 25% 20.600000 691.093623 9.102256 36.840000 50% 30.300000 2425.471293 28.731883 59.460000 75% 50.350000 8880.432040 52.513403 73.490000 max 101.100000 52301.587179 95.638113 100.000000 internetBin count 163.000000 mean 0.748466 std 0.435231 min 0.000000 25% 0.500000 50% 1.000000 75% 1.000000 max 1.000000

The where() function from numpy is really neat here. It returns either one value (1, in this case) or the other (0), depending on a given condition, which in this case is the test whether or not a value in the internetuserate column is higher or lower than the cut-off I defined. The data summary nicely shows that most of the internet usage values are now coded as 1. Note that these binary values are nevertheless numeric.

Now I can run my logistic regression. For this, I need a different function than before, when we used ols(). This time, logit(), also from the statsmodels.formula.api package, will be used.

# logistic regression model for breast cancer and internet usage print("logistic regression model for the association between breast cancer cases and internet use rate") reg1 = smf.logit("internetBin ~ breastcancerper100th", data=sub_data2).fit() print(reg1.summary())

logistic regression model for the association between breast cancer cases and internet use rate Optimization terminated successfully. Current function value: 0.416506 Iterations 8 Logit Regression Results ============================================================================== Dep. Variable: internetBin No. Observations: 163 Model: Logit Df Residuals: 161 Method: MLE Df Model: 1 Date: Thu, 08 Dec 2016 Pseudo R-squ.: 0.2615 Time: 11:01:34 Log-Likelihood: -67.890 converged: True LL-Null: -91.934 LLR p-value: 4.076e-12 ======================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------- Intercept -1.8950 0.577 -3.284 0.001 -3.026 -0.764 breastcancerper100th 0.1058 0.023 4.629 0.000 0.061 0.151 ========================================================================================

The logistic regression function returns a model not unlike that of a linear regression, including a (significant) p-value and a positive coefficient - indicating a positive correlation between internet usage and breast cancer. Since internet usage only has two outcomes now, though, generating a linear equation wouldn't make sense. Instead, we should look at probabilities, or - better yet - odds ratios (OR). The advantage of odds ratios over probabilities here is that an odds ratio is a constant number, while the probability of y being 0 or 1 changes with the value of x (which is still a quantitative variable). The odds ratio can be calculated directly from the coefficient returned by the regression model: it is the natural exponentiation of that coefficient (or parameter estimate). This can be easily calculated with numpy's exp() function:

print("odds ratio") print(numpy.exp(reg1.params))

odds ratio Intercept 0.150325 breastcancerper100th 1.111579 dtype: float64

The same function can also be used to include the confidence intervals returned by the model:

params = reg1.params conf = reg1.conf_int() conf["OR"] = params conf.columns = ["Lower CI", "Upper CI", "OR"] print("odds ratio with confidence intervals") print(numpy.exp(conf))

odds ratio with confidence intervals Lower CI Upper CI OR Intercept 0.048522 0.465722 0.150325 breastcancerper100th 1.062896 1.162493 1.111579

The odds ratio of OR > 1 indicates that an increase in breast cancer prevalence leads to a "presence" of internet usage - something which I already described much better with linear regression. Since an odds ratio can take any value from zero to (positive) infinity, and a value of 1 means that there is an equal probability for either outcome, my OR = 1.1 is not even very impressive. How will the addition of other explanatory variables change that?

# logistic regression model for breast cancer and income with internet usage print("logistic regression model for the association between breast cancer cases and income with internet use rate") reg2 = smf.logit("internetBin ~ breastcancerper100th + incomeperperson", data=sub_data2).fit() print(reg2.summary())

logistic regression model for the association between breast cancer cases and income with internet use rate Optimization terminated successfully. Current function value: 0.368648 Iterations 10 Logit Regression Results ============================================================================== Dep. Variable: internetBin No. Observations: 163 Model: Logit Df Residuals: 160 Method: MLE Df Model: 2 Date: Thu, 08 Dec 2016 Pseudo R-squ.: 0.3464 Time: 11:15:20 Log-Likelihood: -60.090 converged: True LL-Null: -91.934 LLR p-value: 1.479e-14 ======================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------- Intercept -2.0101 0.622 -3.232 0.001 -3.229 -0.791 breastcancerper100th 0.0762 0.026 2.974 0.003 0.026 0.126 incomeperperson 0.0004 0.000 2.602 0.009 0.000 0.001 ======================================================================================== Possibly complete quasi-separation: A fraction 0.15 of observations can be perfectly predicted. This might indicate that there is complete quasi-separation. In this case some parameters will not be identified.

params2 = reg2.params conf2 = reg2.conf_int() conf2["OR"] = params2 conf2.columns = ["Lower CI", "Upper CI", "OR"] print("odds ratio with confidence intervals") print(numpy.exp(conf2))

odds ratio with confidence intervals Lower CI Upper CI OR Intercept 0.039603 0.453280 0.133982 breastcancerper100th 1.026306 1.134709 1.079147 incomeperperson 1.000109 1.000777 1.000443

The addition of income per person to the model had two consequences: first, the odds ratio for the association between breast cancer and internet usage is a little bit lower now, and second, a warning message is included in the model results. It warns about complete quasi-separation, or quasi-complete separation. This means that one of the predictors (which would have to be income, since the warning didn't occur before) almost completely separated the two internet usage categories. Apparently, there is a threshold in income per person below which almost all internet usage values are either 0 or 1, while almost all values above that threshold are the opposite. The problem with this is that the maximum likelihood estimates, which are used in logistic regression, cannot work with data in which the two distributions to compare don't - or barely - overlap, resulting in unreliable parameter estimates. In short: I shouldn't use income per person as explanatory variable here.

I'll test if the urbanisation rate is confounding the relationship between internet usage and breast cancer, then. Internet usage could easily be associated with urbanisation, but probably won't be separated as well by it.

# logistic regression model for breast cancer, income and urbanisation with internet usage print("logistic regression model for the association between breast cancer cases and urbanisation with internet use rate") reg3 = smf.logit("internetBin ~ breastcancerper100th + urbanrate", data=sub_data2).fit() print(reg3.summary())

logistic regression model for the association between breast cancer cases and urbanisation with internet use rate Optimization terminated successfully. Current function value: 0.363538 Iterations 8 Logit Regression Results ============================================================================== Dep. Variable: internetBin No. Observations: 163 Model: Logit Df Residuals: 160 Method: MLE Df Model: 2 Date: Thu, 08 Dec 2016 Pseudo R-squ.: 0.3554 Time: 11:31:15 Log-Likelihood: -59.257 converged: True LL-Null: -91.934 LLR p-value: 6.432e-15 ======================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------- Intercept -3.3883 0.727 -4.662 0.000 -4.813 -1.964 breastcancerper100th 0.0781 0.023 3.370 0.001 0.033 0.124 urbanrate 0.0476 0.012 3.808 0.000 0.023 0.072 ========================================================================================

params3 = reg3.params conf3 = reg3.conf_int() conf3['OR'] = params3 conf3.columns = ['Lower CI', 'Upper CI', 'OR'] print("odds ratio with confidence intervals") print(numpy.exp(conf3))

odds ratio with confidence intervals Lower CI Upper CI OR Intercept 0.008125 0.140320 0.033766 breastcancerper100th 1.033227 1.131535 1.081264 urbanrate 1.023359 1.074744 1.048737

Similar to what I've seen in the multiple linear regression, the urbanisation rate is not truly confounding the association between internet usage and breast cancer. The p-values for both explanatory variables are still very low, and the odds ratios fall into very small confidence intervals. Nevertheless, the odds ratio for breast cancer is even lower than before - closer to 1 and an equal probability for"presence" or "absence" of internet usage.

While logistic regression is an important and valuable tool to analyse categorical data, forcing my data into the right format for analysis did - not surprisingly - not lead to very convincing results.

#RegModPrac #BCCIU #logistic regression #python

Polynomial Regression - R

As always mentioned first, I want to compare Python and R analysis steps in the DataManViz, DataAnaT, and RegModPrac courses and the BCCIU (Breast Cancer Causes Internet Usage) project. Therefore, this is the R version of the Polynomial Regression Python script I posted before. While I used multiple linear regression to test the association between internet use rate (one of my response variables) and multiple explanatory variables (mainly breast cancer) in the first part of week three of the course, I will now apply polynomial regression to the association between female employment rate and breast cancer. I have already seen (for example when using basic linear regression) that there is no linear relationship between these two variables, but there does seem to be a curve that will now be explored further.

Again, the whole thing will look better over here.

I will first run some of my previous code to prepare R, and remove variables I don't need and observations for which important data is missing.

#setwd("C:/Users/nolah_000/Dropbox/coursera/Data Analysis and Interpretation/RegModPrac") #setwd("C:/Users/Sarah/Dropbox/coursera/Data Analysis and Interpretation/RegModPrac") setwd("C:/Users/spo12/Dropbox/coursera/Data Analysis and Interpretation/RegModPrac") options(stringsAsFactors=FALSE) # load libraries library(repr) # for smaller plots suppressMessages(library(ggplot2)) # for plotting suppressMessages(library(gridExtra)) # for plotting ggplots side by side library(car) # for diagnostics # load data gapminder <- read.table("../gapminder.csv", sep=",", header=TRUE, quote="\"") # set row names rownames(gapminder) <- gapminder$country # subset data sub_data <- subset(gapminder, select=c("femaleemployrate", "breastcancerper100th", "incomeperperson")) # remove rows with NAs sub_data2 <- na.omit(sub_data)

The explanatory variable breast cancer (and the second variable for use later, income per person) should be mean centred for easier interpretation. The scale() function in R is a bit more comfortable to use for this than the manual process we used in Python.

# centre breast cancer and income data sub_data2$breastCentre <- scale(sub_data2$breastcancerper100th, scale=FALSE) sub_data2$incomeCentre <- scale(sub_data2$incomeperperson, scale=FALSE) summary(sub_data2)

femaleemployrate breastcancerper100th incomeperperson breastCentre.V1 Min. :12.40 Min. : 3.90 Min. : 103.8 Min. :-33.94753 1st Qu.:38.90 1st Qu.: 20.73 1st Qu.: 609.4 1st Qu.:-17.12253 Median :48.20 Median : 30.15 Median : 2453.6 Median : -7.69753 Mean :47.88 Mean : 37.85 Mean : 7336.3 Mean : 0.00000 3rd Qu.:56.15 3rd Qu.: 50.38 3rd Qu.: 8993.4 3rd Qu.: 12.52747 Max. :83.30 Max. :101.10 Max. :52301.6 Max. : 63.25247 incomeCentre.V1 Min. :-7232.56 1st Qu.:-6726.90 Median :-4882.74 Mean : 0.00 3rd Qu.: 1657.04 Max. :44965.25

R's summary() function has less problems with floats than Python's describe(), so the means of the centred variables are displayed as zeroes here. A bit annoying, though, is the ".V1" notation behind the column names I chose. This is a relic from the scale() function and I don't know how to avoid it. Interestingly, this addition is not shown when I simply print the column names of my data. If someone can explain that, please enlighten me.

Let's start off once more with the basic linear model, to have something to compare the polynomial model to.

fit1 <- lm(femaleemployrate ~ breastCentre, data=sub_data2) summary(fit1)

Call: lm(formula = femaleemployrate ~ breastCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -35.752 -9.360 0.640 8.711 34.604 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.87716 1.16039 41.259 <2e-16 *** breastCentre -0.04464 0.05025 -0.888 0.376 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.77 on 160 degrees of freedom Multiple R-squared: 0.004909, Adjusted R-squared: -0.001311 F-statistic: 0.7892 on 1 and 160 DF, p-value: 0.3757

As before, in the Python script as well as previous analyses, the linear model shows that there is no statistically significant (linear) association between the female employment rate of 2007 and new breast cancer cases registered in 2002. The coefficient of correlation is very low, indicating that the breast cancer variable cannot explain variability in female employment. The coefficient for breast cancer is also very low, which would be fine, but it comes with a high p-value.

Now I can calculate a polynomial regression model by adding a squared breast cancer variable to my basic linear model. This works similarly to what I did in Python, since R also has an identity function, which is called AsIs. The function (to inhibit the interpretation of an object) and the function call (I()) are the same as in Python, though.

fit2 <- lm(femaleemployrate ~ breastCentre + I(breastCentre^2), data=sub_data2) summary(fit2)

Call: lm(formula = femaleemployrate ~ breastCentre + I(breastCentre^2), data = sub_data2) Residuals: Min 1Q Median 3Q Max -32.467 -8.516 0.767 8.389 32.558 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.949267 1.537349 27.937 < 2e-16 *** breastCentre -0.255227 0.066140 -3.859 0.000165 *** I(breastCentre^2) 0.009240 0.002024 4.565 9.96e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13.93 on 159 degrees of freedom Multiple R-squared: 0.1202, Adjusted R-squared: 0.1091 F-statistic: 10.86 on 2 and 159 DF, p-value: 3.788e-05

Similar to Python, adding this polynomial term highly improved the significance rating of the coefficients, and also increased the r². Nevertheless, the variability in breast cancer still explains only around 12% of the variability in female employment. This is not nearly as good as I got with internet usage and breast cancer.

It's time to visualise this association once more, and then consider what else can be done to improve the model.

# set plot sizes options(repr.plot.width=8, repr.plot.height=4) # scatterplot for breast cancer versus female employment # with linear regression line linear <- ggplot(sub_data2, aes(x=breastCentre, y=femaleemployrate)) + geom_point(colour="blue") + geom_smooth(method=lm) + xlab("centred breast cancer cases 2002") + ylab("female employ rate 2007") + ggtitle("Scatterplot for the Linear Association between\nBreast Cancer and Female Employment") + theme(plot.title=element_text(size=10), axis.text=element_text(size=8), axis.title.x=element_text(size=8), axis.title.y=element_text(size=8)) # scatterplot for breast cancer versus female employment # with second order polynomial regression line polynom <- ggplot(sub_data2, aes(x=breastCentre, y=femaleemployrate)) + geom_point(colour="blue") + geom_smooth(method=lm, formula=y ~ x + I(x^2)) + xlab("centred breast cancer cases 2002") + ylab("female employ rate 2007") + ggtitle("Scatterplot for the Polynomial Association between\nBreast Cancer and Female Employment") + theme(plot.title=element_text(size=10), axis.text=element_text(size=8), axis.title.x=element_text(size=8), axis.title.y=element_text(size=8)) grid.arrange(linear, polynom, ncol=2)

The graphs nicely show the convex curve in the data that makes the polynomial regression model a better fit than the linear one. Nevertheless, the countries with fewer breast cancer cases than the mean are still widely distributed and don't follow the curve as well as those with a higher number of breast cancer cases. This doesn't look like it can be resolved with the addition of another polynomial. In fact, I doubt that this can be resolved at all. The data set contains more countries with a breast cancer prevalence lower than the mean than countries with higher numbers (since this is the mean, and not the median, this is absolutely possible, and it means that there are few countries with very high numbers causing the high mean value). In the countries with fewer breast cancer cases, female employment rates can be anything from really low to really high.

So, what have I learned so far about the relationship between breast cancer and female employment, based on the Gapminder data set?

Based on the visualisations I did of my three variables of interest, female employment was the only variable that displayed a normal (bell-shaped) distribution.

Also, countries with low breast cancer prevalence can apparently have any female employment rate from the spectrum.

When using breast cancer quartiles and looking at female employment rates inside those quartiles, there is a significant difference between the female employment rates in countries with low breast cancer prevalence and all other countries, according to ANOVA, though.

Most countries with high female employment rates have low breast cancer prevalence, and a Chi-squared test again revealed a statistically significant difference in female employment between countries with low breast cancer prevalence and all other countries.

Pearson correlation and basic linear regression both showed that there is no linear relationship between female employment and breast cancer, as the variance within the variables is almost as high as the variance between them.

When using income per person as moderator on the Pearson correlation analysis, a weak negative correlation between breast cancer and female employment can be detected in low income countries, while there is a weak positive correlation in high income countries.

I wonder whether it would help to take the income per person into account in the regression analysis as well.

fit3 <- lm(femaleemployrate ~ breastCentre + I(breastCentre^2) + incomeCentre, data=sub_data2) summary(fit3)

Call: lm(formula = femaleemployrate ~ breastCentre + I(breastCentre^2) + incomeCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -32.521 -8.535 0.741 8.364 32.522 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.291e+01 1.583e+00 27.103 < 2e-16 *** breastCentre -2.511e-01 7.657e-02 -3.279 0.00128 ** I(breastCentre^2) 9.312e-03 2.139e-03 4.354 2.39e-05 *** incomeCentre -1.742e-05 1.618e-04 -0.108 0.91438 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13.97 on 158 degrees of freedom Multiple R-squared: 0.1203, Adjusted R-squared: 0.1036 F-statistic: 7.2 on 3 and 158 DF, p-value: 0.0001465

fit4 <- lm(femaleemployrate ~ breastCentre + incomeCentre, data=sub_data2) summary(fit4)

Call: lm(formula = femaleemployrate ~ breastCentre + incomeCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -34.821 -9.503 0.904 8.207 34.838 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.8771604 1.1582857 41.335 <2e-16 *** breastCentre -0.1120919 0.0734212 -1.527 0.129 incomeCentre 0.0002038 0.0001620 1.258 0.210 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.74 on 159 degrees of freedom Multiple R-squared: 0.01472, Adjusted R-squared: 0.002322 F-statistic: 1.187 on 2 and 159 DF, p-value: 0.3077

Adding income per person as another variable into the regression models did not improve them, and there is no significant relationship between female employment and income.

The influence of income on the relationship between breast cancer and female employment was never very strong, and was only picked up by the Pearson correlation when using income quartiles to group the other data by. It could clearly be seen in the regression lines based on these quartiles that countries with low income (25%tile) also showed low breast cancer prevalence with higher female employment rates (which decreased when breast cancer prevalence increased). In countries with high income (100%tile), more breast cancer cases were detected while less women were employed, but both variables increased together. Since there seems to be a rather linear relationship between breast cancer and income per person, splitting the data by income would, of course, make the convex relationship between breast cancer and female employment more clear.

#RegModPrac #BCCIU #polynomial regression #R

Multiple Linear Regression - R

Programming without internet is no fun... But now I'm back and things are better.

It is time to program in R once more, after two Python posts! There is a change in layout, though: I installed an R kernel to jupyter, so now I can create my R posts exactly like I create my Python posts. Exciting!

As always mentioned before, I want to compare Python and R analysis steps in the DataManViz, DataAnaT, and RegModPrac projects. Therefore, this is the R version of the Multiple Linear Regression Python script I posted before. Here, I'll use multiple linear regression to test the association between internet use rate (my response variable) and multiple explanatory variables - but first and foremost new breast cancer cases.

Again, the whole thing will look better over here.

I will first run some of my previous code to prepare R, and remove variables I don't need and observations for which important data is missing.

#setwd("C:/Users/nolah_000/Dropbox/coursera/Data Analysis and Interpretation/RegModPrac") setwd("C:/Users/Sarah/Dropbox/coursera/Data Analysis and Interpretation/RegModPrac") #setwd("C:/Users/spo12/Dropbox/coursera/Data Analysis and Interpretation/RegModPrac") options(stringsAsFactors=FALSE) # load libraries library(car) # for diagnostics library(repr) # for smaller plots # load data gapminder <- read.table("../gapminder.csv", sep=",", header=TRUE, quote="\"") # set row names rownames(gapminder) <- gapminder$country # subset data sub_data <- subset(gapminder, select=c("breastcancerper100th", "urbanrate", "internetuserate", "incomeperperson")) # remove rows with NAs sub_data2 <- na.omit(sub_data)

The explanatory variables (all variables but internet usage) should be mean centred for easier interpretation. R's scale() function is here a bit more comfortable than the manual process we used in Python.

# centre breast cancer data sub_data2$breastCentre <- scale(sub_data2$breastcancerper100th, scale=FALSE) sub_data2$incomeCentre <- scale(sub_data2$incomeperperson, scale=FALSE) sub_data2$urbanCentre <- scale(sub_data2$urbanrate, scale=FALSE) summary(sub_data2)

breastcancerper100th urbanrate internetuserate incomeperperson Min. : 3.90 Min. : 10.40 Min. : 0.720 Min. : 103.8 1st Qu.: 20.60 1st Qu.: 36.84 1st Qu.: 9.102 1st Qu.: 691.1 Median : 30.30 Median : 59.46 Median :28.732 Median : 2425.5 Mean : 37.78 Mean : 56.25 Mean :33.747 Mean : 7312.4 3rd Qu.: 50.35 3rd Qu.: 73.49 3rd Qu.:52.513 3rd Qu.: 8880.4 Max. :101.10 Max. :100.00 Max. :95.638 Max. :52301.6 breastCentre.V1 incomeCentre.V1 urbanCentre.V1 Min. :-33.8816 Min. :-7208.60 Min. :-45.84577 1st Qu.:-17.1816 1st Qu.:-6621.28 1st Qu.:-19.40577 Median : -7.4816 Median :-4886.91 Median : 3.21423 Mean : 0.0000 Mean : 0.00 Mean : 0.00000 3rd Qu.: 12.5684 3rd Qu.: 1568.06 3rd Qu.: 17.24423 Max. : 63.3184 Max. :44989.21 Max. : 43.75423

R's summary() function has less problems with floats than Python's describe(), and the means of the centred variables are displayed as zeroes here. A bit annoying, though, are the ".V1" notations behind the column names I chose (I wonder why I didn't notice that last time). They are relics from the scale() function and I don't know how to avoid their creation. Interestingly, these additions are not shown when I simply print the column names of my data. If someone can explain that, please enlighten me.

print("column names of the data.frame:") colnames(sub_data2)

[1] "column names of the data.frame:"

'breastcancerper100th'

'urbanrate'

'internetuserate'

'incomeperperson'

'breastCentre'

'incomeCentre'

'urbanCentre'

Be that as it may, the topic here is multiple linear regression, so let's get started on that. Or, well, on linear regression, as I'd like to once more repeat the basic analysis before moving on to using multiple explanatory variables.

fit1 <- lm(internetuserate ~ breastCentre, data=sub_data2) summary(fit1)

Call: lm(formula = internetuserate ~ breastCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -32.155 -11.719 -1.139 7.980 65.327 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.74736 1.34124 25.16 <2e-16 *** breastCentre 0.95265 0.05819 16.37 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 17.12 on 161 degrees of freedom Multiple R-squared: 0.6248, Adjusted R-squared: 0.6224 F-statistic: 268.1 on 1 and 161 DF, p-value: < 2.2e-16

While the layout is not as nice as that of the Python output, I still enjoy the simplicity of the output from lm(). The most important information is all there, with almost no other clutter: the coefficients show the significant positive correlation of my centred breast cancer variable with internet usage, and the r² value shows that around 62% of the variability in internet usage can be explained by the variability in breast cancer cases. Additionally, the high F-statistic shows that the variance between the variables is a lot higher than the variance within the variables (meaning the result is reliable).

Nevertheless, it would be interesting to see if the income of people, for example, can confound this association. Maybe people in higher income countries have more access to the internet, and to healthcare?

fit2 <- lm(internetuserate ~ breastCentre + incomeCentre, data=sub_data2) summary(fit2)

Call: lm(formula = internetuserate ~ breastCentre + incomeCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -31.987 -10.710 -2.616 8.759 45.835 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.375e+01 1.122e+00 30.066 < 2e-16 *** breastCentre 5.174e-01 7.129e-02 7.258 1.61e-11 *** incomeCentre 1.316e-03 1.575e-04 8.359 2.89e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.33 on 160 degrees of freedom Multiple R-squared: 0.7388, Adjusted R-squared: 0.7356 F-statistic: 226.3 on 2 and 160 DF, p-value: < 2.2e-16

Similar to Python, adding another explanatory variable to lm() is a simple addition of "+ variable" to the formula. The results are also similar to what I observed before: both centred variables are significantly and positively associated with internet usage, and the r² value of the model increased a bit. The F-statistic, on the other hand, is slightly decreased, but still high.

Essentially, this indicates that both breast cancer prevalence and income are associated with internet usage, but do not confound each other (otherwise, one of the associations shouldn't be significant).

How about adding the urbanisation rates of the countries as well?

fit3 <- lm(internetuserate ~ breastCentre + incomeCentre + urbanCentre, data=sub_data2) summary(fit3)

Call: lm(formula = internetuserate ~ breastCentre + incomeCentre + urbanCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -27.339 -9.309 -1.400 7.757 40.351 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.375e+01 1.064e+00 31.727 < 2e-16 *** breastCentre 4.350e-01 7.013e-02 6.203 4.60e-09 *** incomeCentre 1.115e-03 1.561e-04 7.145 3.07e-11 *** urbanCentre 2.605e-01 5.950e-02 4.378 2.16e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13.58 on 159 degrees of freedom Multiple R-squared: 0.7669, Adjusted R-squared: 0.7625 F-statistic: 174.4 on 3 and 159 DF, p-value: < 2.2e-16

Again, not much of a hint of confounding. All three variables are positively associated with internet usage, without really diminishing the associations of the others. The r² also increased a bit more, while the F-statistic further decreased. I assume that the variability between the variables decreases a bit, causing this.

The other part of this week's assignment on coursera was creating some diagnostic plots for the multiple regression model. We started out with a quantile-quantile plot (also called qq plot), which - in Python - plots the quantiles of a model's residuals against theoretical quantiles. The qqplot() function from R's car package, on the other hand, uses studetised residuals (meaning the residuals were divided by the standard deviation) to plot against the theoretical quantiles. The message stays the same, though.

options(repr.plot.width=5, repr.plot.height=5) qqPlot(fit3, main="QQ Plot")

The plot only looks different to the Python version due to the different representation (especially the added confidence envelope in form of the dashed lines), but we can still see the line marking the perfect normal distribution and that the data points move away from that line at both ends. This indicates that the model I created doesn't fully capture the real relationship of internet usage and the explanatory variables I used and whatever I might have missed. I still think it's not too bad, though, and I never expected anything to be perfect.

Next, we plotted the standardised (value minus mean divided by standard deviation) residuals over their observation number. This will yield data with a mean of zero, where each +/- 1 out is one additional standard deviation from the mean. I'll improvise the observation number with a simple range from one to the number of rows in my data here, so that I can use the standard plotting function.

stdres <- rstandard(fit3) options(repr.plot.width=5, repr.plot.height=5) plot(1:nrow(sub_data2), stdres, ylab="Standardized Residual", xlab="Observation Number") abline(0, 0) abline(2, 0, lty=3) abline(-2, 0, lty=3)

As before in Python, we can see that, while many standardised residuals stay within one standard deviation of the mean, some are between two and three. The number of observations with such values is high enough to suggest a poor model fit, or that an additional explanatory variable is still missing.

A third plot we generated to assess the quality of our multiple regression model was a leverage plot. This can show if outliers have an undue influence on the whole model.

influencePlot(fit3, xlab="Leverage")

StudResHatCookDKorea, Rep. 3.069187 0.046902720.1100620 Luxembourg-1.615045 0.149222570.1132288

The influencePlot() function from the car package not only produces such a plot, but also returns more data on the most influential observations. In this case, these are the data from the Republic of Korea and from Luxembourg. While South Korea is clearly an outlier (studentised residual bigger than two), it doesn't have much leverage. Luxembourg, on the other hand, has one of the highest leverages (or hat values), but is not an outlier. By the way, Cook's distance is the measure of leverage, and indicates the effect of deleting a single observation from the data. I assume that the countries marked in the plot are those with the highest Cook's distances, since they also have the largest circle areas, which represent this.

Similar plots can also be directly created by using plot.lm() - or simply plot() on the model:

par(mfrow=c(2,2)) plot(fit3)

The first figure plots the residuals versus fitted values and can be used to detect non-linear patterns in the residuals. In this case, the fitted line is not exactly linear, but it is difficult to assign it any other pattern instead.

The Q-Q plot is the same as I've shown before, suggesting a not-exactly normal distribution. Nevertheless, the deviation from the dashed line is not enough to worry me.

Plotting the scale-, or spread-, location is a good way to visualise the variance in the residuals. The residuals should be spread out equally along the predictors, i.e. randomly spread points along a horizontal line would be nice. Obviously, I'm having a problem with homoscedasticity (or lack thereof) here. Maybe my model is really not as reliable as I'd hoped.

The fourth plot is again a leverage plot, showing essentially the same as before. Only the addition of the plotted Cook's distance is different, though we can't see much of these dashed red lines (except for in the upper right corner). Real cause for worry would be observations outside of these dashed lines, which we don't have here.

In general, I'm not sure how much I should worry about these diagnostics. Apparently I'm still not covering all information from the data in my model, but it seems to work all right. Interesting is that in the first three plots, the same three countries are always highlighted as possibly problematic, and South Korea even appears in all four. What's the matter with these countries?

countries <- c("Korea, Rep.", "Slovak Republic", "Latvia", "United States", "Luxembourg") sub_data2[countries, ]

breastcancerper100thurbanrateinternetuserateincomeperpersonbreastCentreincomeCentreurbanCentreKorea, Rep. 20.4 81.46 82.51593 16372.500 -17.381595 9060.123 25.2142331Slovak Republic 48.0 56.56 79.88978 8445.527 10.218405 1133.150 0.3142331Latvia 44.3 68.12 71.51472 5011.219 6.518405-2301.157 11.8742331United States101.1 81.70 74.24757 37491.180 63.31840530178.803 25.4542331Luxembourg 82.5 82.44 90.07953 52301.587 44.71840544989.210 26.1942331

In all countries that were mentioned in the diagnostics plots above, the internet use rate is high (it could also have values below one). The centred explanatory variables, on the other hand, differ greatly: South Korea sports negative centred breast cancer cases, while Latvia shows negative centred income and the centred urbanisation rate is lowest in the Slovak Republic. Luxembourg sticks out for having high values throughout.

What does that mean? Well, I don't know. Since, based on the linear model, we assume linear, positive relationships between internet usage and all explanatory variables, these five countries with high internet usage and very different explanatory variable values surely stick out. That is probably why they showed up on the plots. Are these big problems? I don't know, but since it's mostly one explanatory variable that breaks away from the expected pattern, it is probably not too bad.

#RegModPrac #BCCIU #multiple regression #diagnostics #R

Polynomial Regression - Python

Welcome back to week three of Regression Modelling in Practice! I'm writing this step in the Breast Cancer Causes Internet Usage! (BCCIU) project in two parts:

The first part applied a multiple regression model to analyse the association of one of my response variables (internet users per 100 people in 2010) with my primary explanatory variable (new breast cancer cases per 100,000 females in 2002) and additional variables (my previously used moderator income per person in 2010, and the percentage of urban populations in 2008). This was also the part of the project to be graded on coursera.

The second part (which is this) will make use of another modelling technique we learned about this week: polynomial regression. This will allow me to better analyse the association between my second response variable (percentage of employed females in 2007) with breast cancer as explanatory variable, since I've already seen that linear regression doesn't work on these data.

As before, the output will look lots better in the nbviewer than on tumblr.

First up comes the code to prepare the raw data, filtering for the country identifiers, breast cancer, and female employment rates.

# activate inline plotting, should be first statement %matplotlib inline # load packages import warnings # ignore warnings (e.g. from future, deprecation, etc.) warnings.filterwarnings('ignore') # for layout reasons, after I read and acknowledged them all! import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf # read in data data = pandas.read_csv("../gapminder.csv", low_memory=False) # subset the data and make a copy to avoid error messages later on sub = data[["breastcancerper100th", "femaleemployrate"]] sub_data = sub.copy() # change data types to numeric sub_data["breastcancerper100th"] = pandas.to_numeric(sub_data["breastcancerper100th"], errors="coerce") sub_data["femaleemployrate"] = pandas.to_numeric(sub_data["femaleemployrate"], errors="coerce") # remove rows with missing values (copy again) sub2 = sub_data.dropna() sub_data2 = sub2.copy()

As the explanatory variable, breast cancer prevalence should be mean centred for easier interpretation. To do this, I have to subtract the variable mean from every single observation.

# take breast cancer case column and subtract mean sub_data2[["breastCentred"]] = sub_data2[["breastcancerper100th"]] - sub_data2[["breastcancerper100th"]].mean() # examine data summary print("data after centring") print(sub_data2.describe()) data after centring breastcancerper100th femaleemployrate breastCentred count 168.000000 168.000000 1.680000e+02 mean 37.550000 47.845238 -1.818651e-14 std 22.944904 14.696742 2.294490e+01 min 3.900000 12.400000 -3.365000e+01 25% 20.550000 39.100000 -1.700000e+01 50% 29.900000 48.200001 -7.650000e+00 75% 50.325000 56.050000 1.277500e+01 max 101.100000 83.300003 6.355000e+01

The describe() function again does not return a mean equal to zero for the centred variable, but the value is close to zero. This is a problem of float representation common in Python, as explained in the Python tutorial.

I'll start again with the basic linear model so that I can compare this to the polynomial regression model.

# regression model for breast cancer and female employment print ("OLS regression model for the association between breast cancer cases and female employment rate") reg1 = smf.ols("femaleemployrate ~ breastCentred", data=sub_data2).fit() print (reg1.summary()) OLS regression model for the association between breast cancer cases and female employment rate OLS Regression Results ============================================================================== Dep. Variable: femaleemployrate R-squared: 0.006 Model: OLS Adj. R-squared: -0.000 Method: Least Squares F-statistic: 0.9456 Date: Tue, 09 Aug 2016 Prob (F-statistic): 0.332 Time: 17:14:25 Log-Likelihood: -688.92 No. Observations: 168 AIC: 1382. Df Residuals: 166 BIC: 1388. Df Model: 1 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------- Intercept 47.8452 1.134 42.189 0.000 45.606 50.084 breastCentred -0.0482 0.050 -0.972 0.332 -0.146 0.050 ============================================================================== Omnibus: 0.035 Durbin-Watson: 1.868 Prob(Omnibus): 0.983 Jarque-Bera (JB): 0.070 Skew: -0.033 Prob(JB): 0.966 Kurtosis: 2.926 Cond. No. 22.9 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For the association of female employment rates and breast cancer cases per 100,000 people, the r² value (or the "coefficient of correlation") at the top right of the OLS regression results is 0.006. This means that the variability in occurrence of new breast cancer cases can explain only 0.6% of the variability in the female unemployment rates based on linear regression.

Below this first part of the results, parameter estimates are presented for either the intercept (female employment) or breast cancer cases. The coefficient for breast cancer is negative, meaning that there might be a negative association between breast cancer and female employment. The high p-value and the confidence intervals encompassing zero indicate that this result is not statistically significant, though, and we shouldn't speak of an association here.

This can also be demonstrated in a scatterplot with the associated regression line.

# plot bivariate scatterplot seaborn.regplot(x="breastcancerper100th", y="femaleemployrate", fit_reg=True, data=sub_data2); plt.xlabel('breast cancer cases 2002'); plt.ylabel('female employ rates 2007'); plt.title('Scatterplot for the Association between Breast Cancer and Female Employment');

Since the data points show a slightly curved distribution, it would make sense to use a second order polynomial regression instead. It looks like this:

# fit second order polynomial seaborn.regplot(x="breastcancerper100th", y="femaleemployrate", scatter=True, order=2, data=sub_data2) plt.xlabel('breast cancer cases 2002'); plt.ylabel('female employ rates 2007'); plt.title('Scatterplot for the Association between Breast Cancer and Female Employment');

This looks better, doesn't it? I can run the regression analysis for this by including a new term in the formula: I(breastCentred**2). The I() here is the so called Identity function from a package called patsy (based on a Monty Python movie character). This function returns its input unchanged, enabling me to add a term for my squared explanatory variable. Without this function, Python would use the **2 as part of the model, instead of as a transformation only for that one variable.

# polynomial regression model for breast cancer and female employment print ("OLS polynomial regression model for the association between breast cancer cases and female employment rate") reg2 = smf.ols("femaleemployrate ~ breastCentred + I(breastCentred**2)", data=sub_data2).fit() print (reg2.summary()) OLS polynomial regression model for the association between breast cancer cases and female employment rate OLS Regression Results ============================================================================== Dep. Variable: femaleemployrate R-squared: 0.124 Model: OLS Adj. R-squared: 0.113 Method: Least Squares F-statistic: 11.64 Date: Tue, 09 Aug 2016 Prob (F-statistic): 1.87e-05 Time: 17:16:22 Log-Likelihood: -678.32 No. Observations: 168 AIC: 1363. Df Residuals: 165 BIC: 1372. Df Model: 2 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------------- Intercept 42.9165 1.495 28.710 0.000 39.965 45.868 breastCentred -0.2665 0.066 -4.052 0.000 -0.396 -0.137 I(breastCentred ** 2) 0.0094 0.002 4.712 0.000 0.005 0.013 ============================================================================== Omnibus: 1.084 Durbin-Watson: 1.675 Prob(Omnibus): 0.581 Jarque-Bera (JB): 1.061 Skew: -0.191 Prob(JB): 0.588 Kurtosis: 2.921 Cond. No. 1.28e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.28e+03. This might indicate that there are strong multicollinearity or other numerical problems.

The model is still far from perfect, but definitely better than before. The r² value is now 0.124, which means that the variability in the explanatory variable (breast cancer cases) can explain 12% of the variability of the response variable (female employment). Additionally, the negative coefficient for the explanatory variable is even lower than before, and now comes with a low p-value and an all-negative 95% confidence interval as well, so this is a statistically significant negative association. The coefficient for the quadratic term, on the other hand, is positive, and this association is also significant. This means that there is indeed some curvature in the data, seen here as a convex shape - starting high, then going down before rising again. The still relatively low r² (compared to 0.625 for breast cancer and internet usage) lets me believe that I could do better with a different model, but this should suffice for now.

At the bottom of the results there is again a warning about strong multicollinearity, which I've also seen in the previous post about multiple regression. In this case, this is definitely expected, since I've only added a variable that's already in the model, only this time it's squared. Of course the squared variable is correlated to the original, and there's nothing more that can be done about that. Apparently, centring the explanatory variable reduces the correlation between linear and quadratic variables, and that I have already done.

#RegModPrac #BCCIU #python #polynomial regression

Multiple Linear Regression - Python

Welcome to week three of Regression Modelling in Practice! I will write this step in the Breast Cancer Causes Internet Usage! (BCCIU) project in two parts:

The first part (which is this) will apply a multiple regression model to analyse the association of one of my response variables (internet users per 100 people in 2010) with my primary explanatory variable (new breast cancer cases per 100,000 females in 2002) and additional variables (my previously used moderator income per person in 2010, and the percentage of urban populations in 2008). This is also the part of the project that will be graded on coursera.

The second part will make use of another modelling technique we learned about this week: polynomial regression. This will allow me to better analyse the association between my second response variable (percentage of employed females in 2007) with breast cancer as explanatory variable, since I've already seen that linear regression doesn't work on these data.

As before, the output will look lots better in the nbviewer than on tumblr.

Preparation

First up comes the code to prepare the raw data, filtering for the country identifiers and breast cancer, internet usage, income, and urbanisation.

# activate inline plotting, should be first statement %matplotlib inline # load packages import warnings # ignore warnings (e.g. from future, deprecation, etc.) warnings.filterwarnings('ignore') # for layout reasons, after I read and acknowledged them all! import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf # read in data data = pandas.read_csv("../gapminder.csv", low_memory=False) # use country names as row names/indices for plotting purposes data.index = data["country"] data.drop("country", axis=1) # subset the data and make a copy to avoid error messages later on sub = data[["breastcancerper100th", "incomeperperson", "internetuserate", "urbanrate"]] sub_data = sub.copy() # change data types to numeric sub_data["breastcancerper100th"] = pandas.to_numeric(sub_data["breastcancerper100th"], errors="coerce") sub_data["incomeperperson"] = pandas.to_numeric(sub_data["incomeperperson"], errors="coerce") sub_data["internetuserate"] = pandas.to_numeric(sub_data["internetuserate"], errors="coerce") sub_data["urbanrate"] = pandas.to_numeric(sub_data["urbanrate"], errors="coerce") # remove rows with missing values (copy again) sub2 = sub_data.dropna() sub_data2 = sub2.copy()

The explanatory variables (all but internet usage) should be mean centred for easier interpretation. To do this, I have to subtract each variable's mean from every single observation for the three variables in question: breast cancer, income, and urbanisation.

# take breast cancer case column and subtract mean sub_data2[["breastCentred"]] = sub_data2[["breastcancerper100th"]] - sub_data2[["breastcancerper100th"]].mean() # take income per person column and subtract mean sub_data2[["incomeCentred"]] = sub_data2[["incomeperperson"]] - sub_data2[["incomeperperson"]].mean() # take urbanisation column and subtract mean sub_data2[["urbanCentred"]] = sub_data2[["urbanrate"]] - sub_data2[["urbanrate"]].mean() # examine data summary print("data after centring") print(sub_data2.describe()) data after centring breastcancerper100th incomeperperson internetuserate urbanrate \ count 163.000000 163.000000 163.000000 163.000000 mean 37.781595 7312.376683 33.747359 56.245767 std 23.122332 10467.625388 27.868070 22.943194 min 3.900000 103.775857 0.720009 10.400000 25% 20.600000 691.093623 9.102256 36.840000 50% 30.300000 2425.471293 28.731883 59.460000 75% 50.350000 8880.432040 52.513403 73.490000 max 101.100000 52301.587179 95.638113 100.000000 breastCentred incomeCentred urbanCentred count 1.630000e+02 1.630000e+02 1.630000e+02 mean -1.299029e-14 -2.081236e-12 2.353945e-15 std 2.312233e+01 1.046763e+04 2.294319e+01 min -3.388160e+01 -7.208601e+03 -4.584577e+01 25% -1.718160e+01 -6.621283e+03 -1.940577e+01 50% -7.481595e+00 -4.886905e+03 3.214233e+00 75% 1.256840e+01 1.568055e+03 1.724423e+01 max 6.331840e+01 4.498921e+04 4.375423e+01

The describe() function again does not return a mean equal to zero for centred explanatory variables, but the values are close to zero. This is a problem of float representation common in Python, as explained in the Python tutorial.

Regression Models

I'll start again with the basic linear model so that I can compare this to the multiple regression models.

# regression model for breast cancer and internet usage print ("OLS regression model for the association between breast cancer cases and internet use rate") reg1 = smf.ols("internetuserate ~ breastCentred", data=sub_data2).fit() print (reg1.summary()) OLS regression model for the association between breast cancer cases and internet use rate OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.625 Model: OLS Adj. R-squared: 0.622 Method: Least Squares F-statistic: 268.1 Date: Mon, 25 Jul 2016 Prob (F-statistic): 4.26e-36 Time: 17:08:58 Log-Likelihood: -693.28 No. Observations: 163 AIC: 1391. Df Residuals: 161 BIC: 1397. Df Model: 1 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------- Intercept 33.7474 1.341 25.161 0.000 31.099 36.396 breastCentred 0.9527 0.058 16.373 0.000 0.838 1.068 ============================================================================== Omnibus: 34.123 Durbin-Watson: 1.945 Prob(Omnibus): 0.000 Jarque-Bera (JB): 54.169 Skew: 1.077 Prob(JB): 1.73e-12 Kurtosis: 4.827 Cond. No. 23.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For the association of internet use rates and breast cancer cases per 100,000 people, the r² value (or the "coefficient of correlation") at the top right of the OLS regression results is 0.625, which we already saw when calculating Pearson's correlation coefficient. It means that the variability in occurrence of new breast cancer cases can explain 62% of the variability in internet usage.

Below this first part of the results, parameter estimates are presented for either the intercept (internet usage) or breast cancer cases. The coefficient for breast cancer is positive, meaning that there is a positive association between breast cancer and internet usage. The low p-value and the close confidence intervals indicate that this is highly significant.

Could this association be confounded by another variable? The income per person could be a likely indicator of how many people have access to the internet in a country. Admittedly, I've used this as moderator for a Pearson correlation before and the primary association was still significant, but it can't hurt to test that again with a multiple regression model.

# regression model for breast cancer and income with internet usage print ("OLS regression model for the association between breast cancer cases, income,\nand internet use rate") reg2 = smf.ols("internetuserate ~ breastCentred + incomeCentred", data=sub_data2).fit() print (reg2.summary()) OLS regression model for the association between breast cancer cases, income, and internet use rate OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.739 Model: OLS Adj. R-squared: 0.736 Method: Least Squares F-statistic: 226.3 Date: Mon, 25 Jul 2016 Prob (F-statistic): 2.26e-47 Time: 17:08:58 Log-Likelihood: -663.74 No. Observations: 163 AIC: 1333. Df Residuals: 160 BIC: 1343. Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------- Intercept 33.7474 1.122 30.066 0.000 31.531 35.964 breastCentred 0.5174 0.071 7.258 0.000 0.377 0.658 incomeCentred 0.0013 0.000 8.359 0.000 0.001 0.002 ============================================================================== Omnibus: 15.040 Durbin-Watson: 2.155 Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.391 Skew: 0.753 Prob(JB): 0.000276 Kurtosis: 3.380 Cond. No. 1.04e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.04e+04. This might indicate that there are strong multicollinearity or other numerical problems.

The r² value for the model increased from 0.625 to 0.739, indicating that even more variability in internet usage can now be explained. Additionally, both coefficients and p-values indicate a highly significant positive association between internet usage and breast cancer or income.

On the other hand, the warnings below the model summary point out a flaw in the model: there could be strong multicollinearity - which means the explanatory variables used could be correlated (e.g. if in countries with higher per person income, more breast cancer cases are detected). This can lead to some instability in the parameter estimation, but it is difficult to predict the exact effects of multicollinearity. In this case, both variables are still significantly associated with my response variable and the coefficients fall nicely into quite small confidence intervals. I'm satisfied with that and will simply add another explanatory variable (urbanisation) to the model.

# regression model for breast cancer, income, and urbanisation with internet usage print ("OLS regression model for the association between breast cancer cases, income,\nurbanisation, and internet use rate") reg3 = smf.ols("internetuserate ~ breastCentred + incomeCentred + urbanCentred", data=sub_data2).fit() print (reg3.summary()) OLS regression model for the association between breast cancer cases, income, urbanisation, and internet use rate OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.767 Model: OLS Adj. R-squared: 0.763 Method: Least Squares F-statistic: 174.4 Date: Mon, 25 Jul 2016 Prob (F-statistic): 4.60e-50 Time: 17:08:58 Log-Likelihood: -654.47 No. Observations: 163 AIC: 1317. Df Residuals: 159 BIC: 1329. Df Model: 3 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------- Intercept 33.7474 1.064 31.727 0.000 31.647 35.848 breastCentred 0.4350 0.070 6.203 0.000 0.297 0.574 incomeCentred 0.0011 0.000 7.145 0.000 0.001 0.001 urbanCentred 0.2605 0.060 4.378 0.000 0.143 0.378 ============================================================================== Omnibus: 11.149 Durbin-Watson: 2.117 Prob(Omnibus): 0.004 Jarque-Bera (JB): 11.470 Skew: 0.631 Prob(JB): 0.00323 Kurtosis: 3.312 Cond. No. 1.04e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.04e+04. This might indicate that there are strong multicollinearity or other numerical problems.

Of course, the error is still there, and nothing much has changed about the model. The r² still increased a bit, and the urbanisation rate is also significantly, positively associated with internet usage.

If one of these variables were a true confounder, the association between the primary explanatory variable (breast cancer) and internet usage would not be significant any more. As it is, all three explanatory variables tested seem to be associated with internet usage without confounding each other much. Only a decrease in the parameter estimates indicates a "less positive" association of the single explanatory variables with the response variable.

Diagnostics

Different diagnostic plots can be used to assess the validity of and possible problems in the model. A quantile-quantile plot (or Q-Q plot) can be used to check whether or not the residuals[1] of the regression model are normally distributed. The Q-Q plot plots the quantiles of a normal distribution against the quantiles of the residuals. If the points follow a straight line, the distributions are similar.

[1]: A residual is the difference between the actual and the predicted value (based on the regression model) of the response variable.

qqfig = sm.qqplot(reg3.resid, line='r')

In the plot above, the red line marks the perfect normal distribution, and most of the points lie close to it, indicating that the residuals are almost normally distributed. Especially at both ends of the line, the quantiles deviate from the norm, though. This could mean that the regression model doesn't fully capture the association between internet usage and the explanatory variables (breast cancer, income, and urban rate). Other factors could be involved as well, it seems.

This can also be further analysed with the following plot.

stdres = pandas.DataFrame(reg3.resid_pearson) plt.plot(stdres, 'o', ls='None') l = plt.axhline(y=0, color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number');

The second plot shows the standardised residuals, which are the residuals minus their mean divided by the standard deviation. This normalisation yields a set of values with a mean of zero and a standard deviation of one. In a standard normal distribution, 68% of the observations are expected to fall within one standard deviation of the mean, and 95% of the observations are expected to fall within two standard deviations of the mean.

In the plot above, this seems to be the case, since many of the plotted values lie between -1 and 1. No standardised residual is smaller than -2, but some are between 2 and 3 (so above SD = 2). Out of 163 values, ten or eleven were greater than or equal to two, which is roughly 6%. This suggests a rather poor model fit and that an important explanatory variable could be missing.

levfig = sm.graphics.influence_plot(reg3, size=8)

The third plot is a leverage, or influence, plot. It visualises how much influence single observations have on the model. The studentised residuals plotted here were divided by their standard deviations, without subtracting the mean first. In this approach, observations with values higher than 2 or smaller than -2 are outliers. The leverage signifies how much the predicted scores of observations would change when one would be removed, and can take values between 0 and 1.

In this case, there are quite a number of outliers at the top of the plot, but they have very low leverage and therefore not much influence on the model. Two other observations, from Japan and Luxembourg, stand out with the highest leverage observed here. They seem to influence the model more than the rest, but since they are not outliers, that's probably not a big problem.

#RegModPrac #BCCIU #multiple regression #diagnostics #python

Basic Linear Regression - R

As mentioned before, I want to compare Python and R analysis steps in the DataManViz, DataAnaT, and now RegModPrac projects. Therefore, this is the R version of the Basic Linear Regression Python script I posted a few days ago. Again, the whole thing will look better over here.

I will first run some of my previous code to remove variables I don't need and observations for which important data is missing.

# load libraries library(ggplot2) # load data gapminder <- read.table("../gapminder.csv", sep = ",", header = TRUE, quote = "\"") # subset data sub_data <- subset(gapminder, select = c("country", "breastcancerper100th", "femaleemployrate", "internetuserate")) # remove rows with NAs sub_data2 <- na.omit(sub_data)

The course also required me to centre my explanatory variable, new breast cancer cases per 100,000 females in 2002. This way, the regression model will be easier to interpret. In R, I'm usually using scale() for such operations.

# centre breast cancer data sub_data2$breastCentre <- scale(sub_data2$breastcancerper100th, scale = FALSE) summary(sub_data2)

country breastcancerper100th femaleemployrate internetuserate Length:162 Min. : 3.90 Min. :12.40 Min. : 0.720 Class :character 1st Qu.: 20.73 1st Qu.:38.90 1st Qu.: 9.637 Mode :character Median : 30.45 Median :47.80 Median :29.440 Mean : 37.90 Mean :47.73 Mean :34.082 3rd Qu.: 50.38 3rd Qu.:55.88 3rd Qu.:52.769 Max. :101.10 Max. :83.30 Max. :95.638 breastCentre.V1 Min. :-33.99691 1st Qu.:-17.17191 Median : -7.44691 Mean : 0.00000 3rd Qu.: 12.47809 Max. : 63.20309

The mean of the original breast cancer data is 37.8969136, while the mean of the centred variable is -3.071103*10^{-15}, or, if rounded, 0. Nice to see that R has the same problems as Python when dealing with floats!

Scatterplots including a linear regression line visualise what I'm going to calculate:

# scatterplot for breast cancer versus internet usage ggplot(sub_data2, aes(x = breastCentre, y = internetuserate)) + geom_point(colour = "blue") + geom_smooth(method = lm) + xlab("breast cancer cases 2002") + ylab("internet use rate 2010") + ggtitle("Scatterplot for the Association between\nBreast Cancer and Internet Usage") + theme(plot.title = element_text(size = 10), axis.title.x = element_text(size = 8), axis.title.y = element_text(size = 8))

# scatterplot for breast cancer versus female employment ggplot(sub_data2, aes(x = breastCentre, y = femaleemployrate)) + geom_point(colour = "blue") + geom_smooth(method = lm) + xlab("breast cancer cases 2002") + ylab("female employ rate 2007") + ggtitle("Scatterplot for the Association between\nBreast Cancer and Female Employment") + theme(plot.title = element_text(size = 10), axis.title.x = element_text(size = 8), axis.title.y = element_text(size = 8))

While there seems to be a linear relationship between breast cancer and internet usage, this cannot be said for breast cancer and female employment. This relationship seems to be more complex, which I've discussed before.

The linear regression function lm() (here not specifically for OLS) works essentially in the same way the Python function did: you enter the response variable, a tilde, the explanatory variable, and the name of the data set from which to take these.

fit1 <- lm(internetuserate ~ breastCentre, data = sub_data2) summary(fit1)

Call: lm(formula = internetuserate ~ breastCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -32.260 -11.796 -1.266 8.260 65.044 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.0820 1.3450 25.34 <2e-16 *** breastCentre 0.9493 0.0583 16.28 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 17.12 on 160 degrees of freedom Multiple R-squared: 0.6237, Adjusted R-squared: 0.6213 F-statistic: 265.2 on 1 and 160 DF, p-value: < 2.2e-16

The result is less structured, but also essentially the same: we get the coefficients, the r² value, the F-statistic, and a p-value. The actual values are identical to the results from the Python script, showing that there is a strong linear association between new breast cancer cases per 100,000 in 2002 and the internet use rate from 2010. The regression formula would again be internet usage = 34.08 + 0.95 * breast cancer, indicating that if the breast cancer variable is increased by one, the internet use rate would be increased almost as much.

fit2 <- lm(femaleemployrate ~ breastCentre, data = sub_data2) summary(fit2)

Call: lm(formula = femaleemployrate ~ breastCentre, data = sub_data2) Residuals: Min 1Q Median 3Q Max -35.599 -9.206 0.590 8.591 34.774 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.73086 1.15870 41.194 <2e-16 *** breastCentre -0.04325 0.05022 -0.861 0.39 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.75 on 160 degrees of freedom Multiple R-squared: 0.004613, Adjusted R-squared: -0.001608 F-statistic: 0.7416 on 1 and 160 DF, p-value: 0.3905

The results for the association between female employment from 2007 with breast cancer are also the same as before, which means that there is no linear relationship between the two variables. This also means that R's lm() function essentially calculates an ordinary least squares regression.

#RegModPrac #BCCIU #linear regression #R

Basic Linear Regression - Python

Welcome back (once more)! I handed in my PhD thesis last week, so now I should finally have time for the next course in the Data Analysis and Interpretation specialisation: Regression Modelling in Practice. I've already described my data for the first course week, and now it's time for basic linear regression in the BCCIU project!

Here's a short recap of my Breast Cancer Causes Internet Usage! (BCCIU) project: I chose to see if there is a relationship between breast cancer and internet usage or female employment, respectively, based on the reduced Gapminder data set provided in the coursera course. The problem with this data is that the variables were all obtained in different years: breast cancer cases per 100,000 females were counted in 2002, while the female employment rate (as % of the female population aged 15 and above) was calculated for 2007, and internet users per 100 people were counted in 2010. This is why I'm not saying that internet usage causes breast cancer -- rather, I'm evaluating if the new breast cancer cases from 2002 influenced internet users and female employment in later years. So far, in the Data Analysis and Interpretation course series, we learned how to prepare, manage, and visualise data in python, before moving on to data analysis. This included ANOVA (comparing the means of different groups of the explanatory variable), the Chi-squared test (comparing categorical variables), and Pearson correlation for numerical data, with and without moderators.

This time, we'll use basic linear regression, for which I'll have to centre my quantitative explanatory variable (new breast cancer cases per 100,000 females in 2002). I will then test a linear regression model for each response variable (female employment rate in 2007 and internet usage in 2010) and evaluate the regression coefficients and p-values.

As before, the output will look lots better in the nbviewer than on tumblr.

First up comes the code to prepare the raw data, filtering for the country identifiers and breast cancer, female employment, and internet usage.

# activate inline plotting, should be first statement %matplotlib inline # load packages import warnings # ignore warnings (e.g. from future, deprecation, etc.) warnings.filterwarnings('ignore') # for layout reasons, after I read and acknowledged them all! import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.api import statsmodels.formula.api as smf # read in data data = pandas.read_csv("../gapminder.csv", low_memory=False) # subset the data and make a copy to avoid error messages later on sub = data[["country", "breastcancerper100th", "femaleemployrate", "internetuserate"]] sub_data = sub.copy() # change data types to numeric sub_data["breastcancerper100th"] = pandas.to_numeric(sub_data["breastcancerper100th"], errors="coerce") sub_data["femaleemployrate"] = pandas.to_numeric(sub_data["femaleemployrate"], errors="coerce") sub_data["internetuserate"] = pandas.to_numeric(sub_data["internetuserate"], errors="coerce") # remove rows with missing values (copy again) sub2 = sub_data.dropna() sub_data2 = sub2.copy()

To centre the explanatory variable, I have to subtract the mean of that variable from every data point. This should only be done for the breast cancer column in the data, though.

# examine data summary before centering explanatory variable print("data frame before centering breast cancer cases") print(sub_data2.describe()) # copy the subsetted data sub_data3 = sub_data2.copy() # calculate the mean for breast cancer sub2_mean = sub_data2[["breastcancerper100th"]].mean() # take breast cancer case column and substract mean, replacing original data in data frame sub_data3[["breastcancerper100th"]] = sub_data2[["breastcancerper100th"]] - sub2_mean # examine data summary print("\ndata frame after centering breast cancer cases") print(sub_data3.describe()) data frame before centering breast cancer cases breastcancerper100th femaleemployrate internetuserate count 162.000000 162.000000 162.000000 mean 37.896914 47.730864 34.081991 std 23.142723 14.735980 27.819118 min 3.900000 12.400000 0.720009 25% 20.725000 38.900000 9.637458 50% 30.450000 47.799999 29.439699 75% 50.375000 55.875000 52.769074 max 101.100000 83.300003 95.638113 data frame after centering breast cancer cases breastcancerper100th femaleemployrate internetuserate count 1.620000e+02 162.000000 162.000000 mean -1.679863e-14 47.730864 34.081991 std 2.314272e+01 14.735980 27.819118 min -3.399691e+01 12.400000 0.720009 25% -1.717191e+01 38.900000 9.637458 50% -7.446914e+00 47.799999 29.439699 75% 1.247809e+01 55.875000 52.769074 max 6.320309e+01 83.300003 95.638113

The describe() function does not return a mean equal to zero after centring the breast cancer cases, but the value is very close to zero. This is a problem of float representation common in Python, as explained in the Python tutorial, and, so far, I couldn't find a solution for it.

Scatterplots for the two relationships in question follow below, including fitted regression lines, courtesy of seaborn's regplot() function. They show that around half of the measurements on the x axis are negative now, due to the centring of the explanatory variable.

# plot bivariate scatterplots fig = plt.figure(figsize=(17,5)) fig.add_subplot(121) seaborn.regplot(x="breastcancerper100th", y="internetuserate", fit_reg=True, data=sub_data3); plt.xlabel("breast cancer cases 2002"); plt.ylabel("internet use rates 2010"); plt.title("Scatterplot for the Association between Breast Cancer and Internet Usage"); fig.add_subplot(122) seaborn.regplot(x="breastcancerper100th", y="femaleemployrate", fit_reg=True, data=sub_data3); plt.xlabel("breast cancer cases 2002"); plt.ylabel("female employ rates 2007"); plt.title("Scatterplot for the Association between Breast Cancer and Female Employment"); fig.tight_layout() plt.show()

The left scatterplot shows a nice linear relationship between breast cancer and internet usage: in countries with more breast cancer cases (in 2002), more people used the internet in 2010. The relationship between breast cancer and female employment, on the other hand, is not linear: while there is a high female employment rate in countries with only few breast cancer cases, and then a drop in employment when the breast cancer prevalence increases, countries with many breast cancer cases again show a higher female employment rate.

The regression lines shown in the plots above can be modelled with statmodels.formula.api's ols() (ordinary least squares) function, which we already used in the last course to do an analysis of variance on our data.

# regression model for breast cancer and internet usage print ("OLS regression model for the association between breast cancer cases and internet use rate") reg1 = smf.ols("internetuserate ~ breastcancerper100th", data=sub_data3).fit() print (reg1.summary()) OLS regression model for the association between breast cancer cases and internet use rate OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.624 Model: OLS Adj. R-squared: 0.621 Method: Least Squares F-statistic: 265.2 Date: Thu, 14 Jul 2016 Prob (F-statistic): 8.79e-36 Time: 16:49:23 Log-Likelihood: -688.97 No. Observations: 162 AIC: 1382. Df Residuals: 160 BIC: 1388. Df Model: 1 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------- Intercept 34.0820 1.345 25.340 0.000 31.426 36.738 breastcancerper100th 0.9493 0.058 16.284 0.000 0.834 1.064 ============================================================================== Omnibus: 32.936 Durbin-Watson: 1.856 Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.444 Skew: 1.055 Prob(JB): 6.75e-12 Kurtosis: 4.781 Cond. No. 23.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Summaries of OLS regression models return many statistical values. For the association of internet use rates and breast cancer cases per 100,000 people, the r² value (or the "coefficient of correlation") is 0.624, which we already saw when calculating Pearson's correlation coefficient. It means that the variability in occurence of new breast cancer cases can explain 62% of the variability in internet usage. The F-statistic value is very high (265.2), showing that the variance between the two variables is a lot higher than the variance within each variable. Accordingly, the probability based on this value is very low, so we can reject the null hypothesis of no association between breast cancer and internet usage (p < 0.0001). Below this first part of the results, parameter estimates and other statistical values are presented for either the intercept (internet usage) or breast cancer cases. The values from the "coef" column can be plugged into the linear regression formula: internet usage = 34.08 + 0.95 * new breast cancer prevalence, which could theoretically be used to predict new values. This formula also shows that, if the breast cancer prevalence increases by one, the internet use rate will (possibly) also increase by almost one.

# regression model for breast cancer and female employment print ("\nOLS regression model for the association between breast cancer cases and female employment rate") reg2 = smf.ols("femaleemployrate ~ breastcancerper100th", data=sub_data3).fit() print (reg2.summary()) OLS regression model for the association between breast cancer cases and female employment rate OLS Regression Results ============================================================================== Dep. Variable: femaleemployrate R-squared: 0.005 Model: OLS Adj. R-squared: -0.002 Method: Least Squares F-statistic: 0.7416 Date: Thu, 14 Jul 2016 Prob (F-statistic): 0.390 Time: 16:49:23 Log-Likelihood: -664.82 No. Observations: 162 AIC: 1334. Df Residuals: 160 BIC: 1340. Df Model: 1 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------- Intercept 47.7309 1.159 41.194 0.000 45.443 50.019 breastcancerper100th -0.0432 0.050 -0.861 0.390 -0.142 0.056 ============================================================================== Omnibus: 0.024 Durbin-Watson: 1.933 Prob(Omnibus): 0.988 Jarque-Bera (JB): 0.012 Skew: -0.007 Prob(JB): 0.994 Kurtosis: 2.960 Cond. No. 23.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As expected, judging from the scatterplot (or all my old results), the r² and the F-statistic values are very low (r² = 0.005, F = 0.74). This means that the variance within the variables is almost as high as the variance between tha variables, and the variability of the female employment rate can explain almost none of the variability of the breast cancer cases. Accordingly, the p-value is relatively high (p = 0.39). Thus, while I could create a regression model formula for this association, using again the "coef" values, I don't think it would be useful.

#RegModPrac #BCCIU #linear regression #python

Multiple Linear Regression - Python

Welcome to week three of Regression Modelling in Practice! I will write this step in the Breast Cancer Causes Internet Usage! (BCCIU) project in two parts:

As before, the output will look lots better in the nbviewer than on tumblr.

Preparation

First up comes the code to prepare the raw data, filtering for the country identifiers and breast cancer, internet usage, income, and urbanisation.

# activate inline plotting, should be first statement %matplotlib inline # load packages import warnings # ignore warnings (e.g. from future, deprecation, etc.) warnings.filterwarnings('ignore') # for layout reasons, after I read and acknowledged them all! import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf # read in data data = pandas.read_csv("../gapminder.csv", low_memory=False) # use country names as row names/indices for plotting purposes data.index = data["country"] data.drop("country", axis=1) # subset the data and make a copy to avoid error messages later on sub = data[["breastcancerper100th", "incomeperperson", "internetuserate", "urbanrate"]] sub_data = sub.copy() # change data types to numeric sub_data["breastcancerper100th"] = pandas.to_numeric(sub_data["breastcancerper100th"], errors="coerce") sub_data["incomeperperson"] = pandas.to_numeric(sub_data["incomeperperson"], errors="coerce") sub_data["internetuserate"] = pandas.to_numeric(sub_data["internetuserate"], errors="coerce") sub_data["urbanrate"] = pandas.to_numeric(sub_data["urbanrate"], errors="coerce") # remove rows with missing values (copy again) sub2 = sub_data.dropna() sub_data2 = sub2.copy()

# take breast cancer case column and subtract mean sub_data2[["breastCentred"]] = sub_data2[["breastcancerper100th"]] - sub_data2[["breastcancerper100th"]].mean() # take income per person column and subtract mean sub_data2[["incomeCentred"]] = sub_data2[["incomeperperson"]] - sub_data2[["incomeperperson"]].mean() # take urbanisation column and subtract mean sub_data2[["urbanCentred"]] = sub_data2[["urbanrate"]] - sub_data2[["urbanrate"]].mean() # examine data summary print("data after centring") print(sub_data2.describe()) data after centring breastcancerper100th incomeperperson internetuserate urbanrate \ count 163.000000 163.000000 163.000000 163.000000 mean 37.781595 7312.376683 33.747359 56.245767 std 23.122332 10467.625388 27.868070 22.943194 min 3.900000 103.775857 0.720009 10.400000 25% 20.600000 691.093623 9.102256 36.840000 50% 30.300000 2425.471293 28.731883 59.460000 75% 50.350000 8880.432040 52.513403 73.490000 max 101.100000 52301.587179 95.638113 100.000000 breastCentred incomeCentred urbanCentred count 1.630000e+02 1.630000e+02 1.630000e+02 mean -1.299029e-14 -2.081236e-12 2.353945e-15 std 2.312233e+01 1.046763e+04 2.294319e+01 min -3.388160e+01 -7.208601e+03 -4.584577e+01 25% -1.718160e+01 -6.621283e+03 -1.940577e+01 50% -7.481595e+00 -4.886905e+03 3.214233e+00 75% 1.256840e+01 1.568055e+03 1.724423e+01 max 6.331840e+01 4.498921e+04 4.375423e+01

Regression Models

I'll start again with the basic linear model so that I can compare this to the multiple regression models.

Diagnostics

[1]: A residual is the difference between the actual and the predicted value (based on the regression model) of the response variable.

qqfig = sm.qqplot(reg3.resid, line='r')

This can also be further analysed with the following plot.

stdres = pandas.DataFrame(reg3.resid_pearson) plt.plot(stdres, 'o', ls='None') l = plt.axhline(y=0, color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number');

levfig = sm.graphics.influence_plot(reg3, size=8)

#RegModPrac #BCCIU #multiple regression #diagnostics #python

#regmodprac

Trending Tags

Recently Viewed Tags

#regmodprac