Time-Series Misspecification

Let’s see the following equation:

Yt = b0 + b1X1t + b2Yt–1 + ε_t_

We assume that the error term is serially correlated, which means ε_t is correlated with εt-1, Because εt-1 is correlated with Yt–1. Then εt _is correlated with Yt–1.

The book says, in this case, the lagged dependent variable, Yt−1, will be correlated with the error term, violating the assumption that the independent variables are uncorrelated with the error term (heteroskedasticity). As a result, the estimates of the regression coefficients will be biased and inconsistent.

However, violating the assumption that the independent variables are uncorrelated with the error term will cause heteroskedasticity, which, in this case, ε_t _is correlated with Yt–1. But heteroskedasticity does not affect the consistency of the regression parameter estimators. It only affect standard errors and test statistics.

Why the book says violating the assumption that the independent variables are uncorrelated with the error term will cause biased and inconsistent estimates of the regression coefficients?

Really good eye there linking the concepts. I see that you are confusing something. When we talk about heteroskedasticity we are assuming the model is correctly specified, so heteroskedasticity itself will just affect the standard errors of estimate.

Misspecification, in the other hand, is a big problem because the error terms contain the probable independent variables that are actually missing in the model. It is likely that the missing variables are correlated with the included variables so far, so the errors are correlated with the independent variables. This problem creates inconsistency and the only way to fix it is searching for the most correct independent variables mix.

Very good answer, thanks Harrogath

If the missing variables are correlated with the included variables. Then the errors are correlated with the independent variables. This is the case of heteroskedasticity. But heteroskedasticity will not create inconsistency of estimated coefficient. Right?

I don’t see how the latter follows from the former.

Please explain your reasoning here.

In general (and in a more econometric parlance), the unobservable term (error) absorbs omitted variables in the case that we left something out (variables, higher order terms)… which doesn’t necessarily mean we will have heteroscedasticity.

Say the true model is

E(Y) = b0 + b1x1 + b2x2

We fit: yi = b0 + b1x1

The error is the difference in the actual and predicted value: E(Y - yi) = b2x2

This is actually a common way to check for misspecification in residual analysis: plot the residuals against the X variables. For example: a plot of the residuals against a quantitative X1 should show know pattern. If it shows, say, a quadratic pattern, it would indicate we left out a transformation of X1 (maybe X1 squared). We would fit that term, rerun the model and check the new residuals against X1 again (and should see no pattern).

Heteroscedasticity can occur with or without model misspecification (usually more common in cross-sectional data, whereas serial correlation is more common in time-series data).

This violates the property regarding the expected value of the error, but heteroscedasticity refers to the variance of the error-- different things.

This isn’t true. See above.

It doesn’t always do this, but it can because it violates the assumption that is often referred to as exogeneity of the regressors. In other words, E(e|x) = 0 should hold, but it is violated.

I would recommend carefully revisting each of the regression assumptions and the terminology used throughout the chapter. Pay close attention to the word choice (it’s easy to jumble these ideas).

Hope this helps!

Missing variables should be a part of the error. Then if the missing variables are correlated with the included variables, errors are correlated with the independent variables.

Right, but remember-- this isn’t the same as heteroscedasticity!

I see. Heteroscedasticity means the variance of residual correlates with independent variable, not expected value. If the variance of residual correlates with independent variable, it will lead incorrect standard error. But if the expected value of residual correlates with independent variable, it will lead biased and inconsistent coefficient. Right?

I believe that your reasoning’s flawed.

For example, if A and B have a correlation of +0.7, and A and C have a correlation of +0.7, then B and C can have a correlation anywhere between −0.02 and +1.00.

Remember the types of heteroskedasticity, conditional (very bad) and unconditional (not that bad, admissible at least). If you encounter conditional heteroskedasticity, then your first move is looking for a better fit because your model is not correctly specified at all. Conditional heteroskedasticity is when the errors themselves (not their variance) are correlated with the independent variables.

Conditional heteroscedasticity has to do with the error variance (the assumption is a constant error variance for all settings of the independent variables).

I agree with you S2000. In fact, the whole concept of Omitted Variable Bias is based on this type of reasoning and has been confusing me ever since my first Econometrics class. Allow me to elaborate (using the example from section 5.2 in Reading 10):

Assume the true model is: Y=b_0+b_1 X_1+b_2 X_2 +e

Lets say Y is my weight and it is affected by the amount of calories I consume (X_1) and the amount of exercise X_2.

However, we misspecify the model and leave out X_2, so we get:

Y=a_0+a_1 X_1 +e

Now here is my explanation (1) of the problem. If X_2 has actually explanatory power for the my weight Y, then leaving X_2 will cause the error terms (since they capture everything that is left out) to be correlated with Y.

But here is the explanation (2) of the CFA book (and every other Econometrics textbook). If X_2 is correlated with X_1, leaving it out will cause the error terms to be correlated with X_1.

Why are we worried about a possible correlation between the Xs?? Of course it is possible that if I eat more calories I am more energetic and more likely to exercise, or I am more hungry after working out. But does that really matter. Isn’t it more interesting to us, that both of them have an effect on my weight??

Can someone explain to me why explanation 2 is the correct one?

Never mind my question, it just dawned on me what the problem is (Christmas miracle??). This is from Wiki:

Two conditions are necessary for the Omitted variable bias to be a problem:

  1. the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient is not zero); and

  2. the omitted variable must be correlated with an independent variable specified in the regression (i.e., cov(z,x), is not equal to zero).

If only 1) were the case (but not 2)), we wouldn’t have a problem. In my example above, we would be able to measure how the amount of calories I consume affect my weight.

But if 2) and 1) are present (calories and workout are correlated, when I consume more calories I tend to workout less, because I get sleepy from overeating), what happens is this: I am estimating a regression of weight on calories and leave out workout. Now I see that my calories have a positive effect on my weight, but this effect also measures the workout effect so the parameter estimated will be biased upwards.