Linear Regression Assumptions

I have questions behind the rationale for these three components of the linear regression assumptions:

  1. Independent variable is uncorrelated with the residuals

  2. Variance of the residual term is constant for all observations

  3. Residual term is independently distributed (residuals have correlation of 0).

I am confused on how these three assumptions are required for linear regression models.

The intention of a regression is to explain the behavior of a variable using other variables. This will permit to predict the movements of the variable we want to explain. We want to decompose the dependent variable into parts, each part is an observable variable. However, it is impossible to exactly decompose the dependent variable into others, so an error is unavoidable. Thus, to able to say the regression is good (good fit), the error term of the regression must follow certain characteristics:

  1. The independent variables must not be correlated with the error term because we want to separate the model into 2 parts, a deterministic one and a random one. The independent variables are meant to be the already determined and the error to be random.

  2. The variance of the errors (residuals) must be constant because if not, the parameters won’t be reliable or biased. You will find ahead the problems of heteroskedasticity and other problems.

  3. The residuals must be uncorrelated between them because it will indicate they are random. As set in point 1, the errors are meant to be the random part of the model. That part represents the portion of the dependent variable that is not possible to explain with another variable or variables because it is inherently stochastic (random).

  4. The expected value of the error term is zero. This is because the desired randomness feature of the errors.

Hope this helps.


To clarify some more:

  1. If the error variance is not constant, the standard errors will be inaccurate/biased (and therefore, test statistics/confidence intervals will be invalid, irrespective of sample size). However, the slope estimates will still be unbiased and consistent (this only requires gauss-markov assumptions 1-4 for OLS to be unbiased and consistent). R-squared (and adjusted) is also unaffected (still consistent, since standard errors are still consistent).

  2. If you don’t have independence, you can run into issues with the consistency/unbiasedness of your estimates.

  3. Again, error term equal to zero is needed for unbiased and consistent estimation. If it’s non-zero, look into the model specification.