FYI-- Multicollinearity and its impact on the F-test, R-squared

tickersu · May 26, 2015, 9:42pm

I started a thread a few weeks ago about an error in a QM subject test. The original thread can be found here: http://www.analystforum.com/forums/cfa-forums/cfa-level-ii-forum/91340506 (Note: the question has been changed by the CFAI as a result of me approaching them, but they have yet to change the error in the explanation for that question. The change should be reflected soon, I would think. Read on for an explanation).

My assertion was that the presence of multicollinearity neither under nor overstates the F-statistic or R-squared, because multicollinearity doesn’t impact model fit and OLS is still unbiased with MC. In plain terms, the F-statistic (and R-squared) is unaffected by multicollinearity.

Didn’t get many hits on the initial thread, but I went ahead and emailed the Institute in April regarding the issue. Long story short (and it’s a long, long story), after many back-and-forth emails with an extremely helpful rep, who relayed the info to the QM question writer(s), the issue was eventually pushed to the curriculum author. The rep just informed me that the curriculum author agrees with the assertion I have made-- multicollinearity does not affect (under or overstate) the F-statistic (and it follows for R-squared as well). In terms of standard errors, it is only the standard errors on the affected coefficients that are “inflated”. The SER (and MSE) is unaffected.

I decided I would give a bump on the topic, since I’m not sure how or if they will make any errata adjustments for the question’s explanation. It’s also not mentioned in the official curriculum (directly). I don’t think this will have too big of an impact on test day, but I figured it would be worth passing on, just in case it comes up exam day.

S666 · May 26, 2015, 10:01pm

If only I knew this topic half as well as you do

Way to go buddy, you just schooled the CFAI lol

Mosstastic · May 26, 2015, 10:03pm

You wanna take my quant portion on test day?

tickersu · May 26, 2015, 10:11pm

Eh, I wouldn’t’ say that. I mean, I’ve read 10.5 of their books (didn’t have time for AI/Deriv. at LI [but a BS degree in Finance helped on those] and read QM with Schweser at LI)

sharky7 · May 27, 2015, 9:22am

I wish I master any of the topics like you do quants.

Just for curiosity, what’s your background / job?

tickersu · May 27, 2015, 12:37pm

sharky7:

I wish I master any of the topics like you do quants. Practice, practice, practice! Even after my experience, I’ve realized how broad any field of study can be. Anything you study really well can be “zoomed out” to give you a higher view, and you see there is so much more that you don’t know about the topic-- I can’t believe how much people with Ph.D.s have at least become familiar with in their time…

Just for curiosity, what’s your background / job? Mostly academic for my statistics background-- I tutored statistics at the university level for a few years (included a lot of what is in QM for the CFAI LI & LII curricula). I’ve also done a bit of graduate course work in statistics. So, practice, practice, and pracitce-- and it doesn’t hurt that I enjoy learning about statistics.

**

**

Highland · May 27, 2015, 1:00pm

Take away - MC is not affected by F-stat and R-squared! it’s been memorized, come exam day lol

I hated stats in Uni with a passion! but i’m seriously starting to love it now. I actually plan on spending my summer taking online stat courses from Coursera.com and MIT Open House…free courses!!!

Highland · May 27, 2015, 1:03pm

Correction - I meant to say MIT Opencourseware*

tickersu · May 27, 2015, 1:45pm

Highland:

Take away - MC is not affected by F-stat and R-squared! it’s been memorized, come exam day lol Almost-- just reverse the statement to “F-stat and R-squared [and by deduction, SER] aren’t affected by MC” or make it say “MC does not affect F-stat or R-squared”-- since I want the take away to be clear.

I hated stats in Uni with a passion! but i’m seriously starting to love it now. Once I got to college, I began to realize that math/statstics/physics/chemistry/etc w_as everywhere_ and that I liked all of them . Never understood the kids who said “I’ll never use this in the real world”… it is everywhere

I actually plan on spending my summer taking online stat courses from Coursera.com and MIT Open House…free courses!!! I looked into those as well (I think they have some more time series topics), but I haven’t found the time to get into it yet. It’s interesting to see different presentations on the same material. Overall, I think if you can add more to your repertoire and if you enjoy what you’re learning, there isn’t much downside.

JDP · May 27, 2015, 2:51pm

This is definitely very helpful in many ways, but I think the real value (to me at least) is that it indicates this concept, “multicollinearity does not affect (under or overstate) the F-statistic (and it follows for R-squared as well,” might not be tested on the exam. It’s more likely than not that there will only be only 6 quant problems, so I don’t think they’d want to create any controversies. Just a thought.

hei.so · May 27, 2015, 3:06pm

Wait… so can you give a summary please?

Book says MC is present when R^2 and F-test is significant but t-test is not.

What should it be then? Just the standard error and t-test is not sigificant?

tickersu · May 27, 2015, 5:53pm

I did think of this as a possibility as well, but I wouldn’t rely on it (just incase).

tickersu · May 27, 2015, 7:28pm

I would suggest grabbing a plate of food (or a drink), before reading on. Really (too short) summary: IF you have significant F-test, a relatively high r-squared, but nonsignificant t-tests (need not be all when you have many variables, but that’s a stronger indication), MC is quite possibly and issue-- investigate it. The standard errors on the coefficients can become “large”, making t-tests lack statistical significance.

So, I do think it is difficult for the curriculum to fully cover this topic (and many other topics), since we already are responsible for a large amount of information (imagine a full textbook just for QM…).

Multicollinearity isn’t a black and white issue as much as it is a matter of degree. Typically, though, when people say no MC is preset, they mean it’s not a problem. There is no “test” for MC, but there are several indicators that can suggest it has become an issue (others are Variance Inflation Factors, checking the signs on the estimated coefficients, and more). They are correct in that statement because it does imply that MC is an issue.

Here is why:

If R-squared (preferrably adjusted R-squared) is relatively high, it indicates that the group of independent variables has explained a relatively high proportion of the total sample varition coming from the dependent variable. These independent variables are doing a “decent” job at predicting the DV. Here, we don’t actually know if they are statistically significant, though (we need a hypothesis test).
Moving to the F-test for total model utility, we should recall that this test has a null hypothesis that none of the predictor variables is statistically useful for predicting the DV (as a group, they’re not “good”). Ho: B1=B2=…=Bn=0

–The alternative hypothesis is that at least one of these predictor variables is useful for predicting the DV (we don’t know which one or how many [yet], but we know that something in the model is statistically useful). Ha: at least one Bn not equal to zero (as a group they are “good”).

–So, if we have a significant F-test, that means we have evidence of Ha, at least one independent variable is useful for predicting the DV (as a whole group, these variables are useful).

Recap at this point: R-squared (no measure of reliability) and the F-test (with a measure of reliability) both indicate the variables we have are somewhat useful for predicting the DV. MC won’t “make” either of these statistics larger or more significant.
Now, if we move on to the t-tests, we would expect a similar story-- right? However, if we look at the t-test and see that none of the variables are significant on their own, then we can say MC is probably an issue (since we have conflicting results of the F-test and t-tests).
The conclusion of these points is that we can say MC is likely a problem if we have all of these things together-- a (statistically) useful model, as denoted by the F-test; the R-squared says the independent variables are useful for explaining variation in the DV, yet the t-tests are giving us the exact opposite picture.

As hard as it is to believe, the t-test and the F-tests are both correct-- but it’s because they’re answering different questions. Think of the simplest case: X1 and X2 used to predict Y.

—The F-test answers the following question: are these variables (X1 and X2) useful, as a group? If we reject Ho, then yes, they are useful as a group.

—The t-test on X1s coefficient answers a slightly different question: After accounting for X2, is X1 statistically useful for predicting Y? Well, if X1 is (very) similar to X2 (highly collinear), then we would probably say X1 isn’t useful when we have X2-- they give us the same information about Y!

-------Then, we look at the t-test for X2: After accounting for X1, is X2 statistically significant? Again, they’re very similar, so we would probably say X2 isn’t significant if we have X1.

This is how both t-tests would say each variable isn’t significant (individually, after accounting for the others), but the F-test says at least one is significant for predicting Y.

If you’re following so far, let’s go a little deeper in the rabbit hole: MC can “inflate” the standard errors on the estimated coefficients (so you see “small” t-stats, referring to one coefficient), but it will not affect the SER (which is why R-squared and the F-test are OK—they refer to the whole model). The SER deals with the error term uncertainty (model fit), whereas the standard errors on the coefficients tell us about uncertainty (imprecision) in the estimated coefficients. There is more uncertainty (imprecision) in the estimated coefficients when we have less “unique” data points to estimate the coefficients with-- in other words, if X1 and X2 are collinear (in a problematic way they “share” a lot of information to predict Y), it becomes very difficult to truly see the stand-alone effect X1 has on Y and the stand-alone effect X2 has on Y. Hence, we have less precision in our estimated coefficients for X1 and X2 (another way of looking at this puzzle).

As I’m sure you’re noticing, this isn’t a small topic. There’s quite a bit to it and always more to explore. Hopefully, this helps (and I know it’s a bit more than the original question).

S666 · May 27, 2015, 7:20pm

Mind…blown…

tickersu · May 27, 2015, 7:26pm

It’s more cool (I think it’s easier to see) when you use the math to show it (not really high-level math, either).

caschas · May 27, 2015, 8:09pm

BIg Up Tickersu!

Nice move on the CFAI, respect

Didn’t read it all, but I’ll surely will… minutes are preciouse, 10 days to go