Multiple Linear Regression
Multiple linear regression attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data. The steps to perform multiple linear regression are almost similar to that of simple linear regression. The difference lies in the evaluation. You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.
ASSUMPTIONS
For a successful regression analysis, it's essential to validate these assumptions:
1. Linearity: The relationship between dependent and independent variables should be linear. 2. Homoscedasticity: (constant variance) of the errors should be maintained. 3. Multivariate Normality: Multiple regression assumes that the residuals are normally distributed. 4. Lack of Multicollinearity: It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are not independent of each other. NOTE Having too many variables could potentially cause our model to become less accurate, especially if certain variables have no effect on the outcome or have a significant effect on other variables. There are various methods to select the appropriate variable like--1. Forward Selection; 2. Backward Elimination; 3. Bi-directional Comparison DUMMY VARIABLES Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model. Categorical data refers to data values which represent categories--data values with a fixed and unordered number of values, for instance, gender (male/female). In a regression model, these values can be represented by dummy variables--variables containing values such as 1 or 0 representing the presence or absence of the categorical value. DUMMY VARIABLE TRAP The Dummy Variable trap is a scenario in which two or more variables are highly correlated. In simple terms, one variable can be predicted from the others. Intuitively, there is a duplicate category. If we dropped the male category it is inherently defined in the female category (zero female value indicate male, and vice-versa). The solution to the dummy variable trap is to drop one of the categorical variables. If there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value. Source: Avik Jain















