Why is there an assumption that residuals in linear regression have to be normally distributed? How is this different from residuals being independent of each other and the mean of the residuals equaling zero?
I am confused on why we need these assumptions and how they related to each other.
A really fine point-- the distinction between errors and residuals. The residuals are our estimates of the errors (and the assumptions are about the errors, so we use the residuals to help us see if those assumptions are reasonably satisfied).
Any how, if you really want to know, I would find a detailed walk through (video or PDF) of the OLS derivation (it’s part of the mathematics). The model is derived and at a certain point the assumptions help you get to the final form. Additionally, these assumptions confer certain properties to the model depending on which out of the 6 or so are satisfied.
For example: the idea is that yi= b0 + b1xi1 + ei for the ith observation. We know we can’t get this right for everyone, but we know a few other things. We know that the sum of the errors is zero (and therefore, the mean would be zero). So we can look at our equation and realize that we can be right, on average.
yi= b0 + b1xi1 + ei and taking the expected value, we get E(Yi)=b0 + b1xi1 + E(ei) = b0 + b1xi1 since we know the expected value for the errors is zero. This would also show you another fact that the point E(xi), E(yi) will always fall on the regression line (but is aside from the point we were making).
The independence assumption is needed as part of a group of assumptions to allow for unbiased and consistent estimation of the parameters. This is where you might want to have a video walk you through to see why and where this happens.
Finally, normality of the error distribution, can be relaxed in larger samples but if it’s satisfied then the OLS estimation is the same as the maximum likelihood estimation (and if I recall, it also helps with t-tests and F-tests).
These are most likely beyond the exam, but if you want to know, I would suggest a quick video or guided derivation to see most of these.
We most deal with the sample data and cannot observe error, we can only observe residuals. Wikipedia gives a very good example:
"…For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the “error” is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the “error” is −0.05 meters. The expected value, being the mean of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.
A residual (or fitting deviation), on the other hand, is an observable estimate of the unobservable statistical error. Consider the previous example with men’s heights and suppose we have a random sample of n people…The difference between the height of each man in the sample and the observable sample mean is a residual."
Now to your question, why do we assume the normality assumption? Any regression result is incomplete without checking whether the explanatory variables are significant alog with other test, and as a result, we need to conduct T and F tests. These tests cannot be conducted without assuming the normality of errors. Why? Because if errors are not normally distributed, then sum of their squares wont follow a chi square distribution. And if they do not follow chi square distribution, we wont be able to get the t and f statistics.
I would note that the Wikipedia example is only accurate in cases where you’re using x-bar as the model. The (prediction) error is a difference between real values of Y and those corresponding Y values predicted by the model (of course our estimate of this is the residual, as we noted). So if x-bar is the model (using no independent variables) then we can take the difference of the actual Y values and the predicted Y values (which would be x-bar for every case). Typically, we’re going to use Y values predicted by a model of the form E(y) = B0 + B1X1 +…+BnXn, which would provide values different from the sample mean in most cases. I think the important thing to notice is that their example implies our model is only the sample mean, which might be a little confusing to people.