I am so confused about this: From my understanding, basically a regression model is misspecified if any of the independent variables are correlated with another independent variable. This would result in multicollinearity. The way to correct for this is to drop one of the correlated independent variables.
However, model misspecification occurs when we OMIT an independent variable that is correlated with the other independent variables. Apparently, we should include this ommited variable to correct the misspecification.
Don’t these two things contradict eachother? One says we should exclude independent variables that are correlated, and the other says we should add back ommitted independent variables that are correlated.
There are different ways in which a model can be misspecified. Multicollinearity does not, in itself, indicate a misspecification. You don’t necessarily need to drop any variables from the model when you have multicollinearity-- it really depends on the overall goal of your analysis (what you’re trying to do with the model).
If you omit a variable and it is correlated with the other independent variables, you have misspecified the model and are violating the mean error(disturbance) of zero assumption, E(U|X)=0. OLS will not be your friend in this case.
You’ve noticed that this topic isn’t very well addressed in the curriculum (and many introductory statistics books). Less than perfect multicollinearity is not a violation of any regression assumption, and OLS still is unbiased and consistent in this case. If you omit a variable and violate the mean error of zero assumption, OLS is invalid (biased and inconsistent). You can use OLS in the presence of multicollinearity, but not when you’ve violated the assumption of E(U|X)=0. As a practical matter, multicollinearity can be mitigated (collect more data), and might not be a real problem (again, depending on your specific intentions for the model). In practice, there is a discussion about the tradeoffs of reducing collinearity versus sacrificing the unbiasedness and consistency of OLS (basically, deal with the MC to keep the regression valid).
You are correct that the (one of many) suggested and non-essential fixes for multicollinearity can conflict with avoiding omitted variable bias. The latter is a more serious issue. Long story short, don’t violate a key regression assumption (you can’t use the regression, if you are in violation).
As a final note, there are some other technicalities that would make it okay or not okay to omit a variable from the regression, even if it’s correlated with the other IVs. For example, an IV that doesn’t truly influence the DV doesn’t need to be in the regression (and won’t cause problems), even if it is correlated with other IVs.
Thanks Tickersu, that was a lot to take it. At least now I know I wasn’t totally missing something. I feel like this should be addressed in the reading otherwise candidates could be misled.
Tickersu’s somewhat rigorous explanation was great; maybe this will be more accessible: You bring up a good point. Most multiple regressions do have a bit of (multi)collinearity among independent variables since we’re typically looking at strictly financial data that’s closely related. In life, and on the test, you’ll probably be tipped off to significant collinearity via a big F-test result for insignificant regressors, but you won’t be able to reject the null hypothesis of each slope coefficient being insignificant. This is caused by large standard errors for the coefficients, which increases the size of your null test’s confidence intervals and leads to more Type-II errors. There is kind of a catch 22 here: since each coefficient adds value to the predictive ability of the test, dropping them would be dumb for forecasting short-term projections. But with a model you want an (as close as you can get to a) one-on-one relationship between independent and dependent. (It would be easy just to use a simple regression for everything!). The problem becomes, say in a few weeks/months when you plug in new values into the independent variables, you might have previously ‘overfit’ the model via your collinearity, and the success of prior predictions was more due to what was in the X than in the Y. If the collinearity later on down the line is different, your old model that had a low SEE and seemed useful, could very well have a high RMSE when you actually forecast with new variables plugged into it. In other words, it masks the statistical power of independent variables, for better or worse.