Why doesn’t the text remove the variables whose coefficients are insignificant from the regression equation, using the original equation instead to estimate the independent variable?
I don’t understand your question entirelly, but will try to explain something:
Some researchers state that it is not always a good idea to retire independent variables that depict insignificant coefficients from a statistical point of view, but that still shows sound economical explanation for the dependent variable. This is like “accepting” a less tight threshold for that specific “insignificant” variable. It is still upon the decision of the researcher to retire or keep the variable, tho.
Don’t understand this part of your question. Perhaps can you elaborate further your doubt?
Data are noisy and a “large” p-value in one data set doesn’t actually say anything about whether that variable is truly part of the real regression function. As Harrogath mentioned, subject matter expertise should nearly always win over an arbitrary p-value. P-values themselves are random variables with incredible variability. People unfamiliar with statistics have a false sense of confidence in trivial things like p-values because there is mathematics used to estimate that quantity (and others). Mathematics is often precise, provable. Statistics and probability are entirely built on using mathematics and logic to estimate how imprecise and uncertain things are, but people often conflate statistics (uncertain) for mathematics (much more certain).
If the text did as you suggest, this starts to walk the line of p-hacking and overfitting a model to intricacies of a particular data set. This is a common and generally terrible approach to fitting models, which is why many non-statisticians use it without a care in the world.
Imagine you drop a plate on the floor and it shatters into big and little pieces; imagine we put the plate back together (miraculously) but could see all of the big and little cracks. Now if I asked you to draw a highly detailed sketch of this plate, you would have some large, general crack patterns and then very tiny intricate patterns.
If we dropped a second plate and fix it in a similar manner, it may look loosely like the drawing of the plate. However, all of the tiny intricate patterns will likely be highly different, and surely the tiny, detailed crack pattern in your drawing isn’t a very accurate representation of most broken and repaired plates (but the big pattern probably does an okay job).
This is like fine tuning a model to one data set (i.e. letting the data tell you all of the decisions, bad idea); maybe some general features hold up in the real world on new data, but often you’ve fit a bunch of noise from the original data set, and that noise isn’t a real feature to describe what’s really happening. Then, your model performance tanks because you overfit it.
Yes, I have the same question I believe. If a coefficient is found to NOT be significantly different than zero, do we remove the corresponding variable or keep it because it may help the model overall by how it interacts with the other variables?
See my post above…and the word “interaction”/“interact” has a very specific meaning in statistics.
For the purposes of the CFA exam, do whatever their book says (often misinformed). In real life, a nonsignificant p-value is usually a very poor criterion for removing a variable from a model.
Duncan J Murdoch, Yu-Ling Tsai & James Adcock (2008) P-Values are Random Variables, The American Statistician, 62:3, 242-245
A relevant article worth reading on P-value behaviors.
Thanks for your reply
What I meant in the second part was why the text didn’t retire insignificant variables and use new regression equations without them. Putting economical consideration in context makes it a lot clearer now, though. Thanks a lot.
Thanks for such a detailed and informative reply.
I just have one more question: Putting aside the economical importance of the variable, if the data set itself is broken or not representative enough of the population, wouldn’t using the estimated coefficients for the variables be inaccurate in the first place?
Can you elaborate more on what you mean by “…if the data set itself is broken or not representative enough of the population…”?
Are you asking what happens, in general, if we have a nonrepresentative sample of the population of interest or are you asking something else?