Multicollinearity and Model Misspecification

KPON123 · May 24, 2015, 3:13pm

I am so confused about this: From my understanding, basically a regression model is misspecified if any of the independent variables are correlated with another independent variable. This would result in multicollinearity. The way to correct for this is to drop one of the correlated independent variables.

However, model misspecification occurs when we OMIT an independent variable that is correlated with the other independent variables. Apparently, we should include this ommited variable to correct the misspecification.

Don’t these two things contradict eachother? One says we should exclude independent variables that are correlated, and the other says we should add back ommitted independent variables that are correlated.

What am I missing?

tickersu · May 24, 2015, 3:59pm

There are different ways in which a model can be misspecified. Multicollinearity does not, in itself, indicate a misspecification. You don’t necessarily need to drop any variables from the model when you have multicollinearity-- it really depends on the overall goal of your analysis (what you’re trying to do with the model).

If you omit a variable and it is correlated with the other independent variables, you have misspecified the model and are violating the mean error(disturbance) of zero assumption, E(U|X)=0. OLS will not be your friend in this case.

You’ve noticed that this topic isn’t very well addressed in the curriculum (and many introductory statistics books). Less than perfect multicollinearity is not a violation of any regression assumption, and OLS still is unbiased and consistent in this case. If you omit a variable and violate the mean error of zero assumption, OLS is invalid (biased and inconsistent). You can use OLS in the presence of multicollinearity, but not when you’ve violated the assumption of E(U|X)=0. As a practical matter, multicollinearity can be mitigated (collect more data), and might not be a real problem (again, depending on your specific intentions for the model). In practice, there is a discussion about the tradeoffs of reducing collinearity versus sacrificing the unbiasedness and consistency of OLS (basically, deal with the MC to keep the regression valid).

You are correct that the (one of many) suggested and non-essential fixes for multicollinearity can conflict with avoiding omitted variable bias. The latter is a more serious issue. Long story short, don’t violate a key regression assumption (you can’t use the regression, if you are in violation).

As a final note, there are some other technicalities that would make it okay or not okay to omit a variable from the regression, even if it’s correlated with the other IVs. For example, an IV that doesn’t truly influence the DV doesn’t need to be in the regression (and won’t cause problems), even if it is correlated with other IVs.

KPON123 · May 24, 2015, 4:43pm

Thanks Tickersu, that was a lot to take it. At least now I know I wasn’t totally missing something. I feel like this should be addressed in the reading otherwise candidates could be misled.

tickersu · May 24, 2015, 5:14pm

Glad to help! In truth, if it were addressed in the curriculum, there would be much, much more to cover (if striving for completeness).

clangerh · May 25, 2015, 3:29am

Tickersu’s somewhat rigorous explanation was great; maybe this will be more accessible: You bring up a good point. Most multiple regressions do have a bit of (multi)collinearity among independent variables since we’re typically looking at strictly financial data that’s closely related. In life, and on the test, you’ll probably be tipped off to significant collinearity via a big F-test result for insignificant regressors, but you won’t be able to reject the null hypothesis of each slope coefficient being insignificant. This is caused by large standard errors for the coefficients, which increases the size of your null test’s confidence intervals and leads to more Type-II errors. There is kind of a catch 22 here: since each coefficient adds value to the predictive ability of the test, dropping them would be dumb for forecasting short-term projections. But with a model you want an (as close as you can get to a) one-on-one relationship between independent and dependent. (It would be easy just to use a simple regression for everything!). The problem becomes, say in a few weeks/months when you plug in new values into the independent variables, you might have previously ‘overfit’ the model via your collinearity, and the success of prior predictions was more due to what was in the X than in the Y. If the collinearity later on down the line is different, your old model that had a low SEE and seemed useful, could very well have a high RMSE when you actually forecast with new variables plugged into it. In other words, it masks the statistical power of independent variables, for better or worse.

tickersu · May 25, 2015, 4:02am

clangerh:

Tickersu’s somewhat rigorous explanation was great; Thank you-- glad to help!

maybe this will be more accessible: You bring up a good point. Most multiple regressions do have a bit of (multi)collinearity among independent variables since we’re typically looking at strictly financial data that’s closely related. You’re right; almost any multiple regression will have some degree of collinearity (even outside of finance)-- but we need to learn to assess when it is a problem (which is what people typically mean when they say the model has multicollinearity-- it’s at a problematic level).

In life, and on the test, you’ll probably be tipped off to significant collinearity via a big F-test result for insignificant regressors, but you won’t be able to reject the null hypothesis of each slope coefficient being insignificant. This is caused by large standard errors for the coefficients, which increases the size of your null test’s confidence intervals and leads to more Type-II errors. There is kind of a catch 22 here: since each coefficient adds value to the predictive ability of the test, dropping them would be dumb for forecasting short-term projections. But with a model you want an (as close as you can get to a) one-on-one relationship between independent and dependent. (It would be easy just to use a simple regression for everything!). Here, I would bring up my point earlier that MC isn’t always an issue depending on your purpose. First, though, multiple regression does give you a “one on one” relationship for the IV with the DV (hence, some books refer to the coefficients as “partial” effects). This is why we look at an estimated coefficient and say “holding all else constant” (in MR)-- because we are looking at the effect of X1 on Y, after accounting for (and independently of) X2, X3,…Xn. To my main point, though, we don’t worry about MC if, for example, we are using the model to generate predicted Y-values. MC doesn’t “distort” model fit, but it can make the estimated coefficients seem a bit “odd” (different than theory would tell us). Another case where MC isn’t really an issue is when we want to look at the coefficient on Xj, but Xk is correlated with Xn (note, Xj is minimally or uncorrelated with Xk and Xn as a group/individually–No/low collinearity of Xj with others). The affected coefficients of Xn and Xk can be “odd”, but the coefficient for Xj is unaffected (it’s actually more “correct”, since we have accounted for other important variables)– therefore, we can examine the magnitude and direction of the effect of Xj on Y without concern for the collinearity between the other variables. If Xj were correlated with the other variables to a problematic degree, though, we would want to avoid making statements about the effect Xj has on Y. There are many other cases to look into, but hopefully, this gives you a small idea as to the breadth of this topic. The problem becomes, say in a few weeks/months when you plug in new values into the independent variables, you might have previously ‘overfit’ the model via your collinearity, and the success of prior predictions was more due to what was in the X than in the Y. If the collinearity later on down the line is different, your old model that had a low SEE and seemed useful, could very well have a high RMSE when you actually forecast with new variables plugged into it. In other words, it masks the statistical power of independent variables, for better or worse. Another good point that you’ve brought up-- if using the model, you should be cautious when applying it to new data (the collinearity should be similar or problems can occur). Using predictor X-values requires you to be careful to mimic the collinearity in the data set that was used to fit the model.