Regression Modeling in Practice - Assign3: Test a Multiple Regression Model
Hypothesis For this assignment I would like to look into the following hypothesis/research question. An individual’s amount of cigarette smoking(nicotine dependence) is associated with the individual’s age on when the very first cigarette was introduced/smoked. Nicotine dependence could also be associated with the introductory age to other form of substance/drugs. (Eg. First time smoke Marijuana, First time used cocaine, First time used Alcohol and etc).
Identifying the explanatory variables that could affect the nicotine independence:
Introductory Age to Cigarette, Introductory age to marijuana have a parameter estimate in negative association with a significant P Value.
Introductory age to Illegal drugs has a parameter estimate in positive with a significant P Value.
Nicotine Dependence is negatively associated with introductory age to smoking after controlling introductory age to marijuana and illegal drugs. (Negative Relationship)
AND
Nicotine Dependence is negatively associated with introductory age to marijuana after controlling introductory age to cigarette and illegal drugs. (Negative Relationship)
AND
Nicotine Dependence is positively associated with introductory age to illegal drugs after controlling introductory age to cigarette and marijuana.
Studying the Confidence Interval of the explanatory variables that are associated with the response variable (nicotine dependence).
Individual's with younger introductory age to cigarette on average will have -1.192 nicotine dependence. As per the confidence interval, the introductory age to cigarette will have nicotine dependence between -1.896 and -0.4886, if a different sample is selected from the population.
Individual's with younger introductory age to marijuana on average will have -1.248 nicotine dependence. As per the confidence interval, the introductory age to marijuana will have nicotine dependence between -1.979 and -0.516 , if a different sample is selected from the population.
Individual's with introductory age to illegal drug on average will have 1.247 nicotine dependence. As per the confidence interval, the introductory age to illegal drugs will have nicotine dependence between 0.237 and 2.256, if a different sample is selected from the population.
Evaluating the fit of the model for the curve linear plot:
The curve linear plot of the quadratic term is not statistically significant.
Q-Q Plot
I using the Q-Q plot to evaluate the assumption that the residuals from the regression model are normally distributed. From the graph below the dots deviates at the higher and lower quantiles. This is an indication that the model's estimated residuals do not follow a perfect normal distribution. From this, it can be concluded that the linear association that we observed in the scatter plot may not be fully estimated by the linear term.
Standardized Residual
For standard normal distribution, the residuals plots should fall between 2 standard deviations for more than 95% (from the observation). There are 5 extreme outliers which fall above 3 standard deviations. The residuals are greater than the absolute value of 2.5. This is enough evidence that the level of error within the model is unacceptable. In other words this models a fairly poor fit to the observed data.
Therefore this hints that I should be including more explanatory variable to explain the variably in order to improve the fit of this model.
The leverage Plot:
Using the leverage plot, I can identify the observations that unusually have large impact on the estimation of the predicted value of the response variable (nicotine dependence) or their outlier.
This graph can show us by how much the predicted score will differ in the observation. In this case the outliers (red dots) are close to zero leverage values which means they don't have strong influence on the estimation of the regression parameters. The circled observation in the graph has a high leverage but it s not an outlier.
/****************************************************************/
/* CODE */
/* Student Name: Rajeev.S Assignment 03 File: c3_assign3-v06.sas Date: 2016-01-25 */
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
/* Addolecents Health Data Sets*/ DATA new; set mydata.addhealth_pds;
LABEL H1GI20 = "Adolecent Age" H1TO2 = "Introductory Age to Cigarette" /*1 to 20*/ H1TO11 = "Introductory age to Chewing Tobacco" /*1 to 18*/ H1TO14 = "Introductory Age to Alcohol" /*1 to 19*/ H1TO30 = "Introductory Age to Marijuana" /*1 to 18*/ H1TO34 = "Introductory age to Cocaine" /*1 to 18*/ H1TO37 = "Introductory Age to Inhalant" /*1 to 18*/ H1TO40 = "Introductory Age to Illegal Drugs" /*1 to 18*/
H1TO32 = "Number of Marijuana smoked per month" H1TO36 = "Number of times Cocaine used in a month" H1TO39 = "Number of times Inhalants used in a month" H1TO42 = "Number of times Illegal drugs used in a month" H1TO7 = "Number of Cigarettes smoked in a day, (within past 30 days)" H1TO16 = "Number of Drinks per Each Drink" ;
/*Introductory age*/ IF H1TO2 = 0 or H1TO2 > 20 THEN H1TO2 = .; IF H1TO11 = 0 OR H1TO11 > 18 OR H1TO11 = '.' THEN H1T011 = .; IF H1TO14 > 19 THEN H1T014 = .; IF H1TO30 = 0 OR H1TO30 > 18 THEN H1T030 = .; IF H1TO34 = 0 OR H1TO34 > 18 THEN H1T034 = .; IF H1TO37 = 0 OR H1TO37 > 18 OR H1TO37 = '.' THEN H1TO37 = .; IF H1TO40 = 0 OR H1TO40 > 18 OR H1TO40 = '.' THEN H1TO40 = .;
/*# of Cig that's smoked in the past 30 days*/ if H1TO7 = 0 or H1TO7 > 95 THEN H1TO7 = .; IF H1TO16 => 90 THEN H1TO16 = .; if H1TO32 = 0 OR H1TO32 > 800 THEN H1TO32 = .; IF H1TO36 = 0 OR H1TO36 > 33 THEN H1TO36 = .; IF H1TO39 = 0 OR H1TO39 > 789 OR H1TO39 = '.' THEN H1TO39 = .; IF H1TO42 = 0 OR H1TO42 > 132 OR H1TO42 = '.' THEN H1TO42 = .;
/*Adolecent Age*/ IF H1GI20 = 96 OR H1GI20 = 97 OR H1GI20 = 98 OR H1GI20 = 99 THEN H1GI20 =.;
*Find the means to center Introductory Age to Tobacco, alcohol; PROC MEANS; VAR H1TO2 H1TO11 H1TO14 H1TO30 H1TO34 H1TO37 H1TO40 H1TO7 H1TO16 H1TO32 H1TO36 H1TO39 ; RUN;
*Centering: - Introductory age to Tobacco - Introductory age to alcohol; data new2; set new; IntroAgeCigarette = H1TO2 - 12.8609587; IntroAgeTobacco = H1TO11 - 91.3552753; IntroAgeAlcohol = H1TO14 - 64.3797663; IntroAgeMarijuana = H1TO30 - 5.1107011; IntroAgeCocaine = H1TO34 - 1.9412669; IntroAgeInhalant = H1TO37 - 12.0000000; IntroAgeIllegalDrugs = H1TO40 - 14.1563126; NumDrinkPerOccasion = H1TO16 - 5.3429454; NumMarijuanaSmokedMonthly = H1TO32 - 14.4137515; NumCocainUsedMonthly = H1TO36 - 7.0379747; NumInhalantUsedMontly = H1TO39 - 14.0714286; NumIllegalDrugsUsedMonthly = H1TO42 - 9.6756757; run;
* Multiple regression model with centered introductory variables introductory ages to any from drugs; PROC GLM; model H1TO7 = H1TO2 IntroAgeAlcohol IntroAgeTobacco IntroAgeMarijuana IntroAgeCocaine IntroAgeInhalant IntroAgeIllegalDrugs NumDrinkPerOccasion /solution; run;
* Multiple regression model with centered introductory variables introductory ages to any from drugs wtih confidence intervals; PROC GLM; model H1TO7 = H1TO2 IntroAgeAlcohol IntroAgeTobacco IntroAgeMarijuana IntroAgeCocaine IntroAgeInhalant IntroAgeIllegalDrugs NumDrinkPerOccasion /solution clparm; run;
*Scatterplot with linear regression line; proc sgplot; reg x=H1TO2 y= H1TO7 / lineattrs=(color=blue thickness=2) clm; yaxis label ="# of Cigarettes Smoked in last 30 days"; xaxis label ="Introductory age to Cigarette"; run;
*Scatterplot with linear regression line; proc sgplot; reg x=H1TO2 y= H1TO7 / lineattrs=(color=blue thickness=2) degree=1 clm; reg x=H1TO2 y= H1TO7 / lineattrs=(color=green thickness=2) degree=2 clm; yaxis label ="# of Cigarettes Smoked in last 30 days"; xaxis label ="Introductory age to Cigarette"; run;
*The Linear regression Model; proc glm; model H1TO7= IntroAgeCigarette / solution clparm; run;
*The Quadratice Regression term; proc glm; model H1TO7 = IntroAgeCigarette IntroAgeCigarette*IntroAgeCigarette /solution clparm; run;
*Evaluating Model Fit; Proc glm plots (unpack)=all; model H1TO7= IntroAgeCigarette IntroAgeMarijuana IntroAgeIllegalDrugs/SOLUTION CLPARM; OUTPUT RESIDUAL=RES STUDENT=STDRES OUT=RESULTS; RUN;
*Evaluating the Residual's Stardard deviation; proc gplot; label stdres ="standarized residual" H1GI20="Addolecent age"; plot stdres*H1GI20/vref=0; run;
/* END*/
/****************************************************************/
OUTPUT














