Quant- Model Misspecification

Example of misspecification of functional form- Omitting a variable: R=b0+b1B+b2lnM+b3lnPB+b4FF+errors R=Return B=beta lnM= natual log of Market cap lnPB= natual log P/B FF= Free float Coeff lnM and FF are found to be statisicially significant at 1%. If lnM is NOT included in the regression model: R=a0+a1B+a2lnPB+a3FF+error Schweser says: If lnM is correlated with any of the remaining indepedent variables (B, lnPB, or FF), then the error term is also correlated wieht the same indepedent variables and the resulting regression coefficients (the estimates of a0,a1,a2) are biased and inconsistent. That means our hypothesis tests and predicitions using the model are unreliable. ************************************************************ I don’t get this. Can a quant explain?

>>If lnM is correlated with any of the remaining indepedent variables (B, lnPB, or FF), then the error term is also correlated wieht the same indepedent variables and the resulting regression coefficients (the estimates of a0,a1,a2) are biased and inconsistent. Mwvt9, take a look at this part from the curriculum book. It’s explained in details, along with an example. I don’t have the book with me now, else I would have given you the page number. Basically, whenever you omit a variable, the co-eff of the remaining variables will be inconsistent and biased and the hypothesis testing won’t be reliable. This is because the t-stats will have incorrect values…the slope co-eff will be incorrect and all that jazz. >>Error term will be correlated with the independent variables by Defn. (through assumption of the multiple regression equation). I don’t know how exactly they got this correlation part, but it was mentioned in the text book too. I couldn’t get a satisfactory derivation from the assumptions, so, I’m just learning this part on face value.

The point here is to eliminate the correlated independent variable from the Multiple Regression Model and then re-run the statistical-softwares to generate a better ANOVA. If we have a model which takes 4 independent variables (GPM, NPM, OPM)

i want to say mwvt and dinesh are talking about 2 different things, no? with the model misspecification, it’s that you’re omitting something that’s important. if you omit something important, it would lead to biased and inconsistent regression coefficients and unreliable hypothesis testing. you’d want to add the important thing you left out to make the regression more accurate? i want to say dinesh is talking about multicollinearity- where some of the independent variables or combos are correlated with eachother? there to correct you remove 1 or more of the correlated independent variables. quant is not a good area of mine… correct me where i go off the tracks, but i want to say we’re talking about 2 diff things above. if anyone good at quant does want to talk more about model misspecification, i’m all ears. i have in a notecard also that you get misspecifications if the data is improperly pooled (makes sense). also have it can be misspecified if explanatory variables are correlated w/ the error term in time series models. (i want to say this was mwvt’s original question and i have no clue about it- hopefully someone can elaborate).

I think the curriculum is so messed up on this stuff I just don’t know where to start. However, I think that there are some serious questions here about biased and consistent estimators for _____. I think you should say something like “misspecified models result in, uh, bad models” and leave it at that.

I am so sorry bannisja and mwvt9. I misread the question in the morning haste and am way off because mwvt9 was talking about model misspecification and I went all hands on multicollinearity. The sentence what mwvt9 explains is called the Omitted variable bias, which surely causes model misspecification and wrong estimates and inferences. So if you have a variable X1 and X2 that was supposed to be considered for the model and we have just considered dependency of Y on the variable X1 (i.e. completely omitted X2), then if X2 was correlated to X1, then all the estimates go BAD. Other causes of misspecification is incorrect pooling of data from various timelines. If we had a high-inflationary phase from 1990 to 2000 and then a constant inflation growth rate from 2000 to 2008, then we should have 2 models, one taking care of 1990-2000 and other working in the later half. Combining both models into a single model will give use all wrong estimates. And finally we consider the economic relation between an dep var and indep var as what economics tell us to take. If it’s empirically proved that log(sales) and Profit Margins are related, then we should just take that and not the sales to PM relationship

dinesh.sundrani Wrote: ------------------------------------------------------- > So if you have a variable X1 and X2 that was > supposed to be considered for the model and we > have just considered dependency of Y on the > variable X1 (i.e. completely omitted X2), then if > X2 was correlated to X1, then all the estimates go > BAD. Here’s what I don’t understand. If X2 was correlated to X1, don’t we want to omit X2 to prevent multicollinearity? I thought omitting X2 would be a good thing, not a bad thing.

dinesh- you had me until “And finally we consider the economic relation between an dep var and indep var as what economics tell us to take. If it’s empirically proved that log(sales) and Profit Margins are related, then we should just take that and not the sales to PM relationship” if you’re not sick of typing, can you (or anyone) elaborate on this one at all?

I somehow feel that there is a tradeoff between incorporating tooo many indep variables (cauing multicollinearity) vs selecting too little of them (causing misspecification). We need to have the right mix to get the model to have a low out-of-space RMSE. duh… I am bad at Time Series and have not revised it since ages. We really need an expert here.

I think the key in this example is that only two of the independent variables were significant in the first place. We then dropped the one of them. I would agree with what dinesh said above, but the more I study this the less I feel I understand.

paging mr. JDV, can you please pick up the white courtesy phone?

dinesh.sundrani Wrote: ------------------------------------------------------- > I somehow feel that there is a tradeoff between > incorporating tooo many indep variables (cauing > multicollinearity) vs selecting too little of them > (causing misspecification). We need to have the > right mix to get the model to have a low > out-of-space RMSE. > > duh… I am bad at Time Series and have not > revised it since ages. We really need an expert > here. Banni - All I wanted to say here is that log(MarketCapitalization) is a better independent variable to consider than MarketCapitalization alone (w/o the log) - when we are quantifying the dependency of the company return on MarketCapitalization. Why so ? - Because difference companies might have totally different Market Capitalizations. A small cap company might have super-heavy returns than a Walmart. So to compare apples-to-apples we need to take the log of MarketCapitalization to make all the companies under consideration on the common grounds