Multicollinearity

lizihengtotti · December 20, 2016, 3:33am

Multicollinearity occurs when two or more independent variables are highly correlated with each other. But what if independent variables are correlated but not highly, do the regression and statistic test still work? What the consequence of Multicollinearity: inconsistency and biased estimated coefficient or biased standard error of coefficient?

S2000magician · December 20, 2016, 4:34am

The estimates of the coefficient values can be unreliable, and the standard errors are too large.

tickersu · December 20, 2016, 6:27am

Multicollinearity is an interesting problem, and the CFA program (and the prep providers) don’t really do it justice in the curriculum. “Highly” correlated isn’t really cut and dry. It usually means “correlated enough to give us ‘weird’ results” (which might be at lower levels of association than in another study or data set).

In the presence of multicollinearity:

-the R-squared, F-test, and predictions are not biased (they work perfectly fine…only caveat might come in with predictions, but that’s definitely beyond the curriculum, and isn’t guaranteed to be a problem)

the estimated coefficients are unbiased and consistent (OLS overall is unbiased and consistent)
the standard errors _ for the coefficients _ (the affected ones) can be affected (inflated by an easily determined amount)
the t-tests on affected coefficients can be nonsignificant (due to an inflated standard error for the coefficient)
the estimated coefficients may appear with different directions or magnitude than what theory or much prior work would suggest (and they become unstable and can vary greatly with even small changes in the data set-- this is what S2000 means when he says “unreliable”-- it’s not good to interpret the magnitude and direction of the coefficients that are affected)

There are some other issues as well, but this is a good framework and should cover you for the exam.

Another thing to be careful of is the seemingly paradoxical (and classically used) situation of “a high R-squared, significant F-test, but nonsignificant t-tests”. This really isn’t paradoxical at all, because the R-squared and F-test answer much different questions than individual t-tests.

For example, say we have the simplest case of multicollinearity:

E(y) = b0 + b1X1 + b2X2

Suppose the R-squared is high, the F-test is significant, and the t-tests for each b1 and b2 are nonsignificant.

The R-squared says: “Approximately Z% of the sample variation in the DV is explained by the model that uses X1 & X2” so it talks about variation explained by the group of independent variables

The F-test answers the question: “As a group, are these independent variables statistically significant for predicting the DV?” or “Is at least 1 independent variable out of the K tested useful for predicting the DV?” or “Is at least one of the tested coefficients statistically different from zero?” (These are really the same question.)

The t-test answers the question: “If I keep all the other variables in the model (say, X1), is this one (say, X2) a statistically significant predictor of the DV?” When you have “highly” related independent variables in the model, they might contribute enough “similar” information to the model where we say “No, if I have X1 I don’t need X2 in the model” and then you look at the t-test on X2 and say “No, if I have X2, then I don’t need X1 in this model.” Now, it should be a little more clear why multicollinearity can cause nonsignificant t-tests with the related variables (and why the F-test and R-squared can give a seemingly different picture).

Edit: this is also assuming that all regression assumptions are met (one of which says no perfect collinearity, less than perfect, even extremely high, is “ok”)…

Rais · December 20, 2016, 1:19pm

The correlation is an indication of Multicollinearity only if you have two independent variables. If you have more than two independent variables, then a high F stat and a high R squared combined with a low T-stat provides the classical case of Multicollinearity. Hope that helps

tickersu · December 20, 2016, 2:25pm

The pair-wise correlation, despite what the CFAI text tells you, can only tell you that multicollinearity might be a problem (unless the correlation magnitude is very high >0.9, but even then it is possible not to be a problem, just more unlikely not to be). Multicollinearity is usually referring to a set of issues caused by a “strong” relationship between 2 or more indepdenent variables. So, my point is, you can have a large correlation coefficient (say 0.8), yet have none of the issues associated with multicollinearity (meaning it’s not a problem). On the other hand, you can have a pair-wise correlation of 0.2 and experience some issues of multicollinearity (thus making it a problem).

There’s a more useful way to assess multicollinearity (in conjunction with the other symptoms you might see), I’m just unsure why they don’t use it in the curriculum, because it wouldn’t take much space or effort on their part.

lizihengtotti · December 21, 2016, 1:27am

tickersu:

lizihengtotti:

Multicollinearity occurs when two or more independent variables are highly correlated with each other. But what if independent variables are correlated but not highly, do the regression and statistic test still work? What the consequence of Multicollinearity: inconsistency and biased estimated coefficient or biased standard error of coefficient?

Multicollinearity is an interesting problem, and the CFA program (and the prep providers) don’t really do it justice in the curriculum. “Highly” correlated isn’t really cut and dry. It usually means “correlated enough to give us ‘weird’ results” (which might be at lower levels of association than in another study or data set).

In the presence of multicollinearity:

-the R-squared, F-test, and predictions are not biased (they work perfectly fine…only caveat might come in with predictions, but that’s definitely beyond the curriculum, and isn’t guaranteed to be a problem)

the estimated coefficients are unbiased and consistent (OLS overall is unbiased and consistent)

the standard errors _ for the coefficients _ (the affected ones) can be affected (inflated by an easily determined amount)

the t-tests on affected coefficients can be nonsignificant (due to an inflated standard error for the coefficient)

the estimated coefficients may appear with different directions or magnitude than what theory or much prior work would suggest (and they become unstable and can vary greatly with even small changes in the data set-- this is what S2000 means when he says “unreliable”-- it’s not good to interpret the magnitude and direction of the coefficients that are affected)

There are some other issues as well, but this is a good framework and should cover you for the exam.

Another thing to be careful of is the seemingly paradoxical (and classically used) situation of “a high R-squared, significant F-test, but nonsignificant t-tests”. This really isn’t paradoxical at all, because the R-squared and F-test answer much different questions than individual t-tests.

For example, say we have the simplest case of multicollinearity:

E(y) = b0 + b1X1 + b2X2

Suppose the R-squared is high, the F-test is significant, and the t-tests for each b1 and b2 are nonsignificant.

The R-squared says: “Approximately Z% of the sample variation in the DV is explained by the model that uses X1 & X2” so it talks about variation explained by the group of independent variables

The F-test answers the question: “As a group, are these independent variables statistically significant for predicting the DV?” or “Is at least 1 independent variable out of the K tested useful for predicting the DV?” or “Is at least one of the tested coefficients statistically different from zero?” (These are really the same question.)

The t-test answers the question: “If I keep all the other variables in the model (say, X1), is this one (say, X2) a statistically significant predictor of the DV?” When you have “highly” related independent variables in the model, they might contribute enough “similar” information to the model where we say “No, if I have X1 I don’t need X2 in the model” and then you look at the t-test on X2 and say “No, if I have X2, then I don’t need X1 in this model.” Now, it should be a little more clear why multicollinearity can cause nonsignificant t-tests with the related variables (and why the F-test and R-squared can give a seemingly different picture).

Edit: this is also assuming that all regression assumptions are met (one of which says no perfect collinearity, less than perfect, even extremely high, is “ok”)…

Why the estimated coefficients are unbiased and consistent? If the estimated coefficients may appear with different directions or magnitude, they must be biased and inconsistent. Right?

tickersu · December 21, 2016, 2:42am

Because there is no assumption that says you can’t have less than perfect collinearity. The other assumptions (used to derive the estimates) are met and therefore allow for the properties of unbiasedness and consistency.

Although it seems reasonable at first glance, it’s not true that they are biased or inconsistent in the presence of multicollinearity. I would recommend revisiting the properties of unbiasedness and consistency (it doesn’t mean the “wrong” value can’t show up, and it doesn’t mean that the estimate can’t change). If there is bias, it is demonstrable with a calculation [i.e. E(b - B) = **a** ; where **a** represents a nonzero expected difference between the the estimator and the true parameter]. The same principle holds for an estimator that is inconsistent (demonstrably so with a calculation).

What’s happening is that it’s hard to parse out the impact of each collinear predictor on Y because they are so similar in terms of the information they contribute (the closer to perfect collinearity, the closer the matrix gets towards becoming singular/noninvertible-- if I recall…when you have perfect collinearity, the matrix is not invertible and the model is not estimable [hence the no perfect collinearity assumption]). (It’s like a classroom filled with a bunch of kids shouting answers to the questions-- you can hear all the answers, but you can’t discern who said which answer.)

Long story short: unbiasedness and consistency are properties that are conferred to OLS by the assumptions. If you’re not violating the assumptions, OLS is unbiased and consistent (and multicollinearity, when less than perfect, does not violate any assumptions). If you want to know more on the mathematics, I would look for a detailed proof of OLS unbiasedness and consistency. When you see each assumption used, you’ll see how it works, and why multicollinearity isn’t a problem for these properties.

lizihengtotti · December 21, 2016, 3:37am

tickersu:

lizihengtotti:

Why the estimated coefficients are unbiased and consistent?

Because there is no assumption that says you can’t have less than perfect collinearity. The other assumptions (used to derive the estimates) are met and therefore allow for the properties of unbiasedness and consistency.

lizihengtotti:

If the estimated coefficients may appear with different directions or magnitude, they must be biased and inconsistent. Right?

Although it seems reasonable at first glance, it’s not true that they are biased or inconsistent in the presence of multicollinearity. I would recommend revisiting the properties of unbiasedness and consistency (it doesn’t mean the “wrong” value can’t show up, and it doesn’t mean that the estimate can’t change). If there is bias, it is demonstrable with a calculation [i.e. E(b - B) = **a** ; where **a** represents a nonzero expected difference between the the estimator and the true parameter]. The same principle holds for an estimator that is inconsistent (demonstrably so with a calculation).

What’s happening is that it’s hard to parse out the impact of each collinear predictor on Y because they are so similar in terms of the information they contribute (the closer to perfect collinearity, the closer the matrix gets towards becoming singular/noninvertible-- if I recall…when you have perfect collinearity, the matrix is not invertible and the model is not estimable [hence the no perfect collinearity assumption]). (It’s like a classroom filled with a bunch of kids shouting answers to the questions-- you can hear all the answers, but you can’t discern who said which answer.)

Long story short: unbiasedness and consistency are properties that are conferred to OLS by the assumptions. If you’re not violating the assumptions, OLS is unbiased and consistent (and multicollinearity, when less than perfect, does not violate any assumptions). If you want to know more on the mathematics, I would look for a detailed proof of OLS unbiasedness and consistency. When you see each assumption used, you’ll see how it works, and why multicollinearity isn’t a problem for these properties.

An estimator is consistent if, as the sample size increases, the estimates converge to the true value of the parameter.

An estimator is unbiased if the expected value of the estimator equals the true parameter.

Can I remember this:

If the assumptions are violated, the standard error of estimators will incorrect, but will not affect the consistency of estimators.

If the model is misspecified, the estimators will be biased and inconsistent.