# Excel Regression

I’m running a regression in Excel and there is a large amount of collinearity I’m getting between my variables. Could someone please provide the quickest way to help combat this in Excel, or would I need a more advanced program? Thanks

you have to reduce the number of variables how many variables are you using? how many datapoints?

not an expert on this sort of stuff, but have you tried eliminating a variable? I doubt fancier software will solve the problem for you… (don’t tell the level II grader’s I can’t spout off the answer…)

The problem is that I’m only using two independent variables, which are also highly correlated to one another. If I only use Variable A, I’m underestimating the strength of the positive relationship. As a result, I also had to include Variable B, which is also strongly positively correlated. The problem is, when I use both variables, it overestimates the strength of the relationship. It’s something I definetely need to get better with over time, but I just wanted a quick answer for the time being. Thanks for the advice.

Sorry: Double Post

Question: are variables A and B positively or negatively correlated? Also: when you run the regression, are the signs of the coefficients for A and B opposite (assuming A&B positively correlated). With multicollinearity, there’s not much you can do other than a) eliminate a variable, b) find a common factor for each variable (and eliminate one or both), or c) expand your data sample. Fancier software won’t really solve this problem. It’s just that there’s not enough information in your dataset to tease out the separate effects of your variables. However, multicollinearity doesn’t bias your estimates, it’s just that your standard errors are larger than they would be if you didn’t have multicollinearity.

bchadwick: Thanks for the response. I’m a little busy right now, but when I get the time I’ll respond. Thanks agian

bchadwick Wrote: ------------------------------------------------------- > > However, multicollinearity doesn’t bias your > estimates, it’s just that your standard errors are > larger than they would be if you didn’t have > multicollinearity. ^ The key. Do you really care that you have multicollinearity? Economists trying to explain the universe care about multicollinearity but people trying to predict gold prices don’t.

Just remove one of the explanatory variables and u shld be fine

But that’s a bad approach. Read bchadwick’s post again. If the two explain more than one, you should use them both (probably).

If all you care about is the prediction, then if both variables explain more, just leave them in and go ahead. As Joey said. You should check if the adjusted R^2 goes down when you remove one of the two variables, just to make sure. You might need to be careful if you care about interpreting “explanatory factors” from regression results with multicollinearity. For example, if variable A and variable B are highly positively correlated, sometimes you get situations where the coefficient of variable A is significant and very positive and the coefficient of variable B is significant and very negative. In those situations, the signs of these coefficients can be highly sensitive to sampling error, in that small changes in your data set (removing a case or two, or adding one) lead to each of the coefficients changing signs suddenly (variable A suddenly has a negative coefficient and variable B is positive). That obviously causes problems if it’s important for you to say that A has such and such an effect on Y and B has this other independent effect on Y. The interpretation of that kinds of result is usually just that both variables are relevant, but it is too difficult to say which one dominates. Then you’d remove one variable to figure out how each one would act independently to get some feeling for general direction and magnitude of each, but you’d have both variables in there for making your predictions. And obviously, you’d want some theoretical reason that both A and B ought to be in the regression in the first place.

mlpguy, from what you said, you are using only 2 explanatory variables. After some thought about the issue I am suspecting that besides contemporaneous serial correlation your model could also be a biased one, because of missing explanatory variables. In the absence of other information, your situation looks very similar to what I experienced when I was doing a project last semester. I used 3 variables, and after some discussion with the professor in charge he said that on top of contemporaneous serial correlation, my model suffered from the omitted-variable bias. Mlpguy, did u also do a hypothesis test to see if a. each variable is significant in explaning the variation in y. If 1 isnt, that u can certainly remove it from the model b. these 2 variables are jointly significant in explanaing variation in y >>“But that’s a bad approach. Read bchadwick’s post again. If the two explain more than one, you should use them both (probably).” --> Not too sure about this, but say, after using the 2 explanatory variables, the adj. R2 is 0.999, and you want to do a forecasting, this will be a very very bad model to do a forecasting with an overfitted model. And in the absence of other information (since he did not mention anything abt. doing a significance test) I guess we probably cant tell exactly whether or not he shld drop the variable…my 0.02 after giving it a bit more thought. Btw I am curious, how do u test for serial correlation using excel? I usually use eviews and I read the DW stat from the output…if u want i could send u a cracked version of it.

Say what? I didn’t see any mention of serial correlation anywhere. Why would we think that is a problem? Why would you think there is “omitted-variable bias”? We don’t even know what the project is about. Multicollinearity is not a sign of model bias, just that lots of information is shared among our independent variables. If the adj R^2 is 0.999 your “overfitted” model is very likely to be an excellent forecasting model. You are missing the point of bchadwicks post (also the content of LII on multicollinearity). If you have multicollinearity in your model, your model is not good at explanation, i.e., you can’t tell much about the relationship between any individual IV and the DV. However, multicollinearity doesn’t affect the quality of your forecasts. To the extent that a new variable adds information, your forecasts should get better (this is not to say that adding a pile of variables makes things better). Significance tests on individual variables are not helpful in multicollinearity because the standard errors of all the estimators are inflated by the multicollinearity. BTW - Both bchadwick and I have been statistics professors.

ahh. shucks i made a terrible mistake by confusing serial correl. with multicollinearity. what was i thinking abt.sorry abt the terrible msitake… the thing about omitted variable bias was not intended to be lumped together with serial correlation/multicollinearity…was just trying to point out that it could be one of the problems that his model has unless he is working on some school project and the model was specified… Joey, I am wondering why the overfitted model makes a good forecast. My crude understanding is that the more variables there are the variance explodes and the RMSE increases? Can’t seem to reconcile the stmt that the “overfitted” mmodel is likely to be an excellent forecasting model with what I learnt in class…would be great if you could enlighten me on that.

Darn I could not edit my post. >> "Significance tests on individual variables are not helpful in multicollinearity because the standard errors of all the estimators are inflated by the multicollinearity. " Perfectly understood cos I understand that the OLS estimator is still BLUE with multicollinearity but the S.E. of the OLS estimators are screwed. But if the correct model was Y=aX1+bX2+ epsilon, and if there is collinearity, the t test should still be valid if it was done separately on each of the variables X1 and X2? i.e. do the regression Y=aX1+u, and get the p-value of X1, and do a second regression Y=bX2+w, and get the p-value of X2? I thought that’s the way to get about doing it?

Hydrogen Rainbow is a great name. It makes me want to eat granola and hug random people.