# LOS 8.i Multiple Regression and ANOVA

I think I understand the coefficients and the R2. It’s basically “the percentage of this Y value that comes from our predictive formula vs error”, correct?

I’m having trouble understanding what the highlighted metric is. The book shows the formula for finding Ra2, but I’m not sure the strategic theme why a CFA would want to know this. In English, what is this supposed to mean?

Kick start your CFA® Program prep with Top Instructors you’ll love and a course that offers free updates until you pass – We’ve got you covered.

startuppivot wrote:
I think I understand the coefficients and the R2. It’s basically “the percentage of this Y value that comes from our predictive formula vs error”, correct?

Not exactly.

It’s the percentage of the variance of Y that is explained by the variance of the Xs.

startuppivot wrote:
I’m having trouble understanding what the highlighted metric is. The book shows the formula for finding Ra2, but I’m not sure the strategic theme why a CFA would want to know this. In English, what is this supposed to mean?

For a single regression, the adjusted R2 is not much use.  (But let’s see what tickersu has to say about it.)  When comparing two regressions, one with an extra explanatory variable, comparing the adjusted R2 values tells you whether or not to add that extra variable; if the adjusted R2 decreases when the variable is added, don’t add it.

Simplify the complicated side; don't complify the simplicated side.

Financial Exam Help 123: The place to get help for the CFA® exams
http://financialexamhelp123.com/

Yes, the variance is what I meant: you said it better than me so thanks again, S2000

The next section of the reading (8.3 to 8.4) kind of explains it so I get the gist now. Thanks!

My pleasure.

Simplify the complicated side; don't complify the simplicated side.

Financial Exam Help 123: The place to get help for the CFA® exams
http://financialexamhelp123.com/

S2000magician wrote:

For a single regression, the adjusted R2 is not much use.  (But let’s see what tickersu has to say about it.)

Adjusted R2 is calculated in a way to penalize for superfluous terms in the model. A simple regression doesn’t need the same adjustment where as a multiple regression might need adjustment because simple  R2 can be forced to 1 by adding more terms in the model irrespective of their statistical utility.

Sorry, was a little fuzzy, long morning.

tickersu wrote:

S2000magician wrote:

For a single regression, the adjusted R2 is not much use.  (But let’s see what tickersu has to say about it.)

Adjusted R2 is calculated in a way to penalize for superfluous terms in the model. A simple regression doesn’t need the same adjustment where as a multiple regression might need adjustment because simple  R2 can be forced to 1 by adding more terms in the model irrespective of their statistical utility.

Sorry, was a little fuzzy, long morning.

But, as I understand it, you identify the superfluous terms by comparing two models: one with the term and one without it.

If you look at the adjusted R2 for only one of those models, you can’t really infer anything from it.  Is that accurate?

Simplify the complicated side; don't complify the simplicated side.

Financial Exam Help 123: The place to get help for the CFA® exams
http://financialexamhelp123.com/

S2000magician wrote:

startuppivot wrote:
I think I understand the coefficients and the R2. It’s basically “the percentage of this Y value that comes from our predictive formula vs error”, correct?

Not exactly.

It’s the percentage of the variance of Y that is explained by the variance of the Xs.

startuppivot wrote:
I’m having trouble understanding what the highlighted metric is. The book shows the formula for finding Ra2, but I’m not sure the strategic theme why a CFA would want to know this. In English, what is this supposed to mean?

For a single regression, the adjusted R2 is not much use.  (But let’s see what tickersu has to say about it.)  When comparing two regressions, one with an extra explanatory variable, comparing the adjusted R2 values tells you whether or not to add that extra variable; if the adjusted R2 decreases when the variable is added, don’t add it.

+1

S2000magician wrote:

But, as I understand it, you identify the superfluous terms by comparing two models: one with the term and one without it.

If you look at the adjusted R2 for only one of those models, you can’t really infer anything from it.  Is that accurate?

Not necessarily that you need to compare R2adj. for more than one model. For example, if I model:

1) Y-hat= b0 + b1X1 + b2X2

I can calculate R2 and R2adj. just for that model and compare the two (raw and adjusted, to see how much they differ). If there is a large discrepancy between R2 and R2adj. for the same model it could signal that there’s a (relative) ton of crap in the model (i.e. the penalty is large for these terms), but this isn’t directly comparing a model with and without the terms, it’s just seeing how R2 holds up when we adjust it for the number of terms relative to the sample size.

This is different than comparing number 1) above with:

2) Y-hat= b0 + b1X1 + b2X2+ b3X3 + b4X4

You could compare 1 & 2 on the basis of R2adj. that would actually be comparing a model with and without (this gets into model building/selection criteria, some use change in R2adj. for example or some pick the model with the largest value of that, big rabbit hole relative to the CFA exam scope, even for me to talk about on a general post. If you want to chat about it for work or teaching or anything, feel free to message me (I’d be happy to talk there), but there are many kinds of selection criteria that can be employed for picking “best” models, and this is just the weee tip of an iceberg).

Of course, Y must be identical in all of these to make these comparisons. It’s bogus to compare Y as a income in millions of USD with a model where Y is the ln(income in millions USD); apples and oranges (this can be tried, though with “pseudo” R-squared measures where predicted ln(Y) values are exponentiated and used to calculated the non-transformed R-squared from the Ln(Y) model to compare to the non Ln model…different discussion).

Does that make sense?

tickersu wrote:
S2000magician wrote:
But, as I understand it, you identify the superfluous terms by comparing two models: one with the term and one without it.

If you look at the adjusted R2 for only one of those models, you can’t really infer anything from it.  Is that accurate?

Not necessarily that you need to compare R2adj. for more than one model. For example, if I model:

1) Y-hat= b0 + b1X1 + b2X2

I can calculate R2 and R2adj. just for that model and compare the two (raw and adjusted, to see how much they differ). If there is a large discrepancy between R2 and R2adj. for the same model it could signal that there’s a (relative) ton of crap in the model (i.e. the penalty is large for these terms), but this isn’t directly comparing a model with and without the terms, it’s just seeing how R2 holds up when we adjust it for the number of terms relative to the sample size.

This is different than comparing number 1) above with:

2) Y-hat= b0 + b1X1 + b2X2+ b3X3 + b4X4

You could compare 1 & 2 on the basis of R2adj. that would actually be comparing a model with and without (this gets into model building/selection criteria, some use change in R2adj. for example or some pick the model with the largest value of that, big rabbit hole relative to the CFA exam scope, even for me to talk about on a general post. If you want to chat about it for work or teaching or anything, feel free to message me (I’d be happy to talk there), but there are many kinds of selection criteria that can be employed for picking “best” models, and this is just the weee tip of an iceberg).

Of course, Y must be identical in all of these to make these comparisons. It’s bogus to compare Y as a income in millions of USD with a model where Y is the ln(income in millions USD); apples and oranges (this can be tried, though with “pseudo” R-squared measures where predicted ln(Y) values are exponentiated and used to calculated the non-transformed R-squared from the Ln(Y) model to compare to the non Ln model…different discussion).

Does that make sense?

Gotcha.

The curriculum’s not particularly clear on that point.

I know: color you surprised.

Simplify the complicated side; don't complify the simplicated side.

Financial Exam Help 123: The place to get help for the CFA® exams
http://financialexamhelp123.com/

Remember you are trying to predict Y variable (dependent) using X variable (independent).

The r-squared tells us how well the x variable explains the y variable.  The higher the r-squared the more variation it explains in the y variable. In other words, If you have a high r-squared your linear equation explains much of the variation in y and vice versa if you have a low r-squared your linear equation has less predictive power of y.  Which do you want? a model with low explanatory power or high explanatory power?

oakwood42 wrote:
Remember you are trying to predict Y variable (dependent) using X variable (independent).

The r-squared tells us how well the x variable explains the y variable.  The higher the r-squared the more variation it explains in the y variable. In other words, If you have a high r-squared your linear equation explains much of the variation in y and vice versa if you have a low r-squared your linear equation has less predictive power of y.  Which do you want? a model with low explanatory power or high explanatory power?

You want a model with high explanatory power, but not one in which a bunch of spurious independent variables have been added to increase the R2 artificially.

Simplify the complicated side; don't complify the simplicated side.

Financial Exam Help 123: The place to get help for the CFA® exams
http://financialexamhelp123.com/

tickersu wrote:

If there is a large discrepancy between R2 and R2adj. for the same model it could signal that there’s a (relative) ton of crap in the model (i.e. the penalty is large for these terms)…

I don’t believe much of this. Empirically, R2 and R2adj calculated from a parsimonious model won’t differ much, practically nothing.

The utility in calculating R2adj resides in that we are now able to compare R2 of different models where the difference is the quantity of variables inside. In the trade-off between adding a new explanatory variable vs reducing (possibly) R2 can only be noticed comparing R2adj of the two models. So, S2000 was right about the use of R2adj in his comment above.

You can run a sensitivity analysis using R2adj formula. Suppose:

SST = 28,280

SSE = 5,060

Data points (n) = 120 (large enough sample)

Number of variables (k) = 3

R2 = 80.91% R2adj = 80.58%

For k = 5

R2 = 80.91% R2adj = 80.24%

For k = 9

R2 = 80.91% R2adj = 79.53%

For k = 50

R2 = 80.91% R2adj = 67.54%

Nobody uses 50 variables in a single model, nor even 9. Most of the time 3 to 6 variables is considered enough explanatory set.

The figure changes a lot for small samples, of course. So, in the case of small data set you better tilt to use 2 or 3 independent variables at most.

Las almas de todos los hombres son inmortales, pero las almas de los justos son inmortales y divinas.
Sócrates

oakwood42 wrote:

Remember you are trying to predict Y variable (dependent) using X variable (independent).

The r-squared tells us how well the x variable explains the y variable.  The higher the r-squared the more variation it explains in the y variable. In other words, If you have a high r-squared your linear equation explains much of the variation in y and vice versa if you have a low r-squared your linear equation has less predictive power of y.  Which do you want? a model with low explanatory power or high explanatory power?

I think you missed the train for this conversation, about 40,000 posts back. It is also not necessarily true that high (low) R-squared is good (bad).

S2000magician wrote:

It’s the percentage of the variance of Y that is explained by the variance of the Xs.

For a single regression, the adjusted R2 is not much use.

For one, I would point out it’s the sampling variation in Y, not the variance, since that term has a specific meaning, but small point.

For the second, I agree in simple linear regression (one x-variable) no utility for adjusted R-squared. But as I talked about earlier and below, for a single model with multiple predictors, it can have a small amount of utility. There are more sophisticated ways to assess overfitting or similar issues of “junk” terms.

Harrogath wrote:

I don’t believe much of this. Empirically, R2 and R2adj calculated from a parsimonious model won’t differ much, practically nothing.

I don’t need you to believe it. I have seen plenty of cases where the two differ quite a bit in absolute terms, even with a parsimonious model. I have literally graded hundreds of exams by hand to see students different projects and analyses with no more than five terms and an intercept in a model. There are plenty of cases where the two values won’t differ much, and there are plenty of cases where they differ quite a bit. One of the main reasons the adjustment calculation was made to was to penalize for overfitting a model to data (now there are many more options for that). The usual R-squared will never decrease (edited for clarity) with more terms in the model, all else constant; the penalty was developed to avoid optimism in the predictive ability of the model that might arise from including more terms, even junk. Hence, the adjustment to r-squared which. This is also the point to encourage parsimony unless there is dramatic improvement from the terms, relative to the sample size (so a sample size of 1000 with 3 terms will have closer values than 3 terms in a sample of 50, again penalization due to relative number of terms to sample size; in short, more degrees of freedom, less penalty). Other penalized statistics include Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC).

Harrogath wrote:
The utility in calculating R2adj resides in that we are now able to compare R2 of different models where the difference is the quantity of variables inside.
That is one benefit of calculating the adjusted (penalized) R-squared. However, this is not the only benefit. The big purpose is that we can look at a model’s explanatory power with penalization for junk in finite samples, which is really only relevant for multiple regression cases. Standard textbooks (from real statisticians) don’t even introduce adjusted R-squared until multiple regression. A large difference between the two values for a single model indicates overfitting a model to data (too many terms relative to the sample size and or fitting noise, essentially). I suggest you read up on these two traditional calculations; nearly every source will tell you that this is what “adjustment” means and why it is done.

Harrogath wrote:
In the trade-off between adding a new explanatory variable vs reducing (possibly) R2 can only be noticed comparing R2adj of the two models.
Not true. This can also be witnessed in the F-statistic or other model-based statistics that are sensitive to the degrees of freedom; so, not only through the adjusted R-squared.

Harrogath wrote:
So, S2000 was right about the use of R2adj in his comment above.
I agreed there is utility for comparing different sets of right-hand-side terms for the same dependent variable.

Harrogath wrote:
You can run a sensitivity analysis using R2adj formula. Suppose:

SST = 28,280

SSE = 5,060

Data points (n) = 120 (large enough sample)

Number of variables (k) = 3

R2 = 80.91% R2adj = 80.58%

For k = 5

R2 = 80.91% R2adj = 80.24%

For k = 9

R2 = 80.91% R2adj = 79.53%

For k = 50

R2 = 80.91% R2adj = 67.54%

How did you run these simulations? There are good ways and bad ways to do this. Providing a detailed explanation of your simulation is generally good so people can evaluate what you did.

Harrogath wrote:
Nobody uses 50 variables in a single model, nor even 9.
There are plenty of models that use 9+ terms in the model. It heavily depends on the field and the quantity of data available. There is a world beyond your line of sight.

Harrogath wrote:
The figure changes a lot for small samples, of course.

Right, this is part of the penalization for too few degrees of freedom (too many terms relative to sample size). This is clearly why there is utility in using both values to look at a single multiple regression model. If I have a sample size of 20 with 6 predictors and an r-squared of 80% the adjustment may well leave me at 50% which would indicate that the benefit (reduced error sum of squares) of all the predictors, relative to the sample size, is overstated by the usual r-squared.

I’m going to be good, and avoid a big back and forth on the rest of this, because we tend to disagree a lot. The narrow scope in the CFA/econ realm is far from what’s out there and what’s even common in other fields.

Instead of taking my experiences or text for it, check out some of the literature on it and some of the common regression texts.

I am interested to continue discussing how you did the simulation, though. It looks as if you only increased the number of terms each time?

tickersu wrote:

Harrogath wrote:

I don’t believe much of this. Empirically, R2 and R2adj calculated from a parsimonious model won’t differ much, practically nothing.

I don’t need you to believe it. I have seen plenty of cases where the two differ quite a bit in absolute terms, even with a parsimonious model. I have literally graded hundreds of exams by hand to see students different projects and analyses with no more than five terms and an intercept in a model. There are plenty of cases where the two values won’t differ much, and there are plenty of cases where they differ quite a bit. One of the main reasons the adjustment calculation was made to was to penalize for overfitting a model to data (now there are many more options for that). The usual R-squared will necessarily increase with more terms in the model, all else constant; the penalty was developed to avoid optimism in the predictive ability of the model that might arise from including more terms, even junk. Hence, the adjustment to r-squared which. This is also the point to encourage parsimony unless there is dramatic improvement from the terms, relative to the sample size (so a sample size of 1000 with 3 terms will have closer values than 3 terms in a sample of 50, again penalization due to relative number of terms to sample size; in short, more degrees of freedom, less penalty). Other penalized statistics include Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC).

R2 is the relevant measure for explanatory power. On the other hand, R2 adjusted is meant for comparing models with different quantity of independent variables. In a sole regression (whatever the quantity of variables) we look at R2. If I want to compare two or more models’ R2, then, I look at R2 adjusted. I see you are misleading the relevance of R2 or thinking that an adjusted measure is superior by definition. Remember that R2 adjusted is derived from R2 and will always be lower than R2, no matter what.

Also, wouldn’t see never a big difference between R2 and R2 adjusted in a parsimonious model with a good sample data (size). If you are talking about models built in the edge of assumptions, then you may be right. Also, I don’t know what is for you a big difference in R2 and R2 adjusted. As you saw in my simulation, changing from 3 to 50 variables R2 is dropped 15 percent points when adjusted. That difference is bad.

On the other hand, what kind of models are we talking about.

tickersu wrote:

Harrogath wrote:

The utility in calculating R2adj resides in that we are now able to compare R2 of different models where the difference is the quantity of variables inside.

That is one benefit of calculating the adjusted (penalized) R-squared. However, this is not the only benefit. The big purpose is that we can look at a model’s explanatory power with penalization for junk in finite samples, which is really only relevant for multiple regression cases. Standard textbooks (from real statisticians) don’t even introduce adjusted R-squared until multiple regression. A large difference between the two values for a single model indicates overfitting a model to data (too many terms relative to the sample size and or fitting noise, essentially). I suggest you read up on these two traditional calculations; nearly every source will tell you that this is what “adjustment” means and why it is done.

Again, I don’t know why you assume R2 adjusted is better than R2 because it penalizes for “junk”. This is not true, R2 captures junk through SSE. Junk is detected when T-cal’s are not statistically significant, when F-cal is not statistically significant, etc.

If you introduce junk into your model, both R2 and R2 adjusted will be lower. And the effect of penalizing for adding 1 extra variable will be the 1% explanation of why R2 fell. This is because R2 adjusted is used to make R2s comparable. It is not an absolute measure, it is a relative one.

tickersu wrote:

Harrogath wrote:

In the trade-off between adding a new explanatory variable vs reducing (possibly) R2 can only be noticed comparing R2adj of the two models.

Not true. This can also be witnessed in the F-statistic or other model-based statistics that are sensitive to the degrees of freedom; so, not only through the adjusted R-squared.

I was talking about in the scenario of R2, not other measures. Otherwise, you would be right.

tickersu wrote:

Harrogath wrote:
So, S2000 was right about the use of R2adj in his comment above.
I agreed there is utility for comparing different sets of right-hand-side terms for the same dependent variable.

Yes, yes. My principal motivation to criticize your comment is that S2000 was right, but instead, you added some other explanations that may be considered misleading, so wanted to clarify.

tickersu wrote:

Harrogath wrote:

You can run a sensitivity analysis using R2adj formula. Suppose:

SST = 28,280

SSE = 5,060 5,400

Data points (n) = 120 (large enough sample)

Number of variables (k) = 3

R2 = 80.91% R2adj = 80.58%

For k = 5

R2 = 80.91% R2adj = 80.24%

For k = 9

R2 = 80.91% R2adj = 79.53%

For k = 50

R2 = 80.91% R2adj = 67.54%

How did you run these simulations? There are good ways and bad ways to do this. Providing a detailed explanation of your simulation is generally good so people can evaluate what you did.

Just apply the below formulas:

I share with you my little model you can replicate in excel. Sorry, it seems I typed SSE = 5,060 in the forum when in fact I was running the model with 5,400. Sorry for that. Fixed above.

tickersu wrote:

Harrogath wrote:

Nobody uses 50 variables in a single model, nor even 9.

There are plenty of models that use 9+ terms in the model. It heavily depends on the field and the quantity of data available. There is a world beyond your line of sight.

Well, we are in a finance forum, and we suppose you do too. However, I would accept disciplines like medicine and other researches could use a lot of variables without falling in the field of increasing variance of errors in the search of “good fit”.

At least, in economy and finance, the data is not infinite and a parsimonious model will always be preferred. 9 variables for an econ model is a crime. Sorry.

tickersu wrote:

Harrogath wrote:

The figure changes a lot for small samples, of course.

Right, this is part of the penalization for too few degrees of freedom (too many terms relative to sample size). This is clearly why there is utility in using both values to look at a single multiple regression model. If I have a sample size of 20 with 6 predictors and an r-squared of 80% the adjustment may well leave me at 50% which would indicate that the benefit (reduced error sum of squares) of all the predictors, relative to the sample size, is overstated by the usual r-squared.

There is the problem, if you do talk about a model in the edge of assumptions or even violated assumptions, then your comments are correct. Otherwise they are not. A sample size of 30 is the bottom possible.

tickersu wrote:

I’m going to be good, and avoid a big back and forth on the rest of this, because we tend to disagree a lot. The narrow scope in the CFA/econ realm is far from what’s out there and what’s even common in other fields.

Not narrow, you are just working on fields different from finance. You are always fighting against the CFA program because the curriculum does not teach in detail regressions for other disciplines. Sorry, but 99% of financial analysts, or investment managers in the entire world will never run a regression estimate in their lives. At most, they will interpret an ANOVA table, if so. CFAI has made its job well enough.

tickersu wrote:

Instead of taking my experiences or text for it, check out some of the literature on it and some of the common regression texts.

Should I consider your selective set of approved books? You may publish the list somewhere.

tickersu wrote:

I am interested to continue discussing how you did the simulation, though. It looks as if you only increased the number of terms each time?

See above. Yes, the only variable controlled was “k”. My intention was to demonstrate that R2 and R2 adjustment are not much different in a parsimonious model with a reasonable sample size.

Las almas de todos los hombres son inmortales, pero las almas de los justos son inmortales y divinas.
Sócrates

The way I see this is that R2 increases as more independent variables are adding to The model, so a high R2 could be cause of The number of independent variables and not cause the explanatory power of that variables on The dependent variable. So thats why adjusted R is calculated (adjusted R is always going to be less or equal to R2)

example: Suppose you have a model with two independent variables, R2 and adjusted R are 60% and 57% respectively. Then you add four more independent variables to The model (Now your model has Six independent variables) and R2 increases to 64% but adjusted R decreases to 55%.

you would prefer the first model.

Julio77 wrote:

The way I see this is that R2 increases as more independent variables are adding to The model, so a high R2 could be cause of The number of independent variables and not cause the explanatory power of that variables on The dependent variable. So thats why adjusted R is calculated (adjusted R is always going to be less or equal to R2)

example: Suppose you have a model with two independent variables, R2 and adjusted R are 60% and 57% respectively. Then you add four more independent variables to The model (Now your model has Six independent variables) and R2 increases to 64% but adjusted R decreases to 55%.

you would prefer the first model.

Exactly. You have a good understanding.

Harrogath wrote:

R2 is the relevant measure for explanatory power. On the other hand, R2 adjusted is meant for comparing models with different quantity of independent variables.

The latter is the penalized version of the other. It penalizes for too complex of a model relative to the sample size and relative to the improvement in reduction of the error variation.

Harrogath wrote:
In a sole regression (whatever the quantity of variables) we look at R2. If I want to compare two or more models’ R2, then, I look at R2 adjusted. I see you are misleading the relevance of R2 or thinking that an adjusted measure is superior by definition. Remember that R2 adjusted is derived from R2 and will always be lower than R2, no matter what.
I’m not misleading by introducing the work and guidance of well-trained practicing statisticians. Outside of your scope doesn’t mean it’s incorrect or misleading.

Harrogath wrote:
Also, wouldn’t see never a big difference between R2 and R2 adjusted in a parsimonious model with a good sample data (size).
No one is arguing this is the case. The adjustment in R-squared penalizes for a complicated model relative to the sample size (read that as lots of terms with small sample); so this encourages parsimony.

Harrogath wrote:
If you are talking about models built in the edge of assumptions, then you may be right. Also, I don’t know what is for you a big difference in R2 and R2 adjusted. As you saw in my simulation, changing from 3 to 50 variables R2 is dropped 15 percent points when adjusted. That difference is bad.
Literally, I have seen 5-20% absolute difference with only a few variables.

Harrogath wrote:
Again, I don’t know why you assume R2 adjusted is better than R2 because it penalizes for “junk”. This is not true, R2 captures junk through SSE. Junk is detected when T-cal’s are not statistically significant, when F-cal is not statistically significant, etc.
R-squared doesn’t penalize for junk; R-squared (error variance) never decreases (decreases) as you add more terms to the model, irrespective of their statistical utility. This is blatantly obvious in that the model is fit by minimizing the sum of squared errors which is y-yhat where yhat is b0 + b1X1 + b2X2+…+bkXk … the more terms in that, the smaller the sum of squared errors will necessarily be– this is straight forward. That increases R-squared. Junk can increase R-squared, but not necessarily the adjusted R-squared.

Harrogath wrote:
If you introduce junk into your model, both R2 and R2 adjusted will be lower.
This just simply isn’t true. You need to review how R-squared is calculated.

Harrogath wrote:

I was talking about in the scenario of R2, not other measures. Otherwise, you would be right.

You explicitly said that R-squared was the only way to tell.

Harrogath wrote:
Yes, yes. My principal motivation to criticize your comment is that S2000 was right, but instead, you added some other explanations that may be considered misleading, so wanted to clarify.
Again, being unfamiliar with the subject doesn’t mean I’m misleading. It means you have some reading to do before continuing to reply based on personal feeling on the subject. I’m speaking strictly from 1) formal education 2) advice and discussion with real statisticians 3) personal experience and observation 4) not from my feeling on the subject.

Harrogath wrote:
Just apply the below formulas:

I share with you my little model you can replicate in excel. Sorry, it seems I typed SSE = 5,060 in the forum when in fact I was running the model with 5,400. Sorry for that. Fixed above.

I’m not asking for the formula, that’s not in contention. You have shown us that you didn’t actually conduct a simulation. A good way to do this would be to generate, say a set of X,Y (with two true X variables) that are known to follow some regression model you specify for the simulation. Add some noise to Y so the relationship isn’t deterministic. Then, generate a bunch of other, unpaired x-variables with random values (simulating random independent variables that are junk). Calculate r-squared and adjusted from the multivariable model of Y, X1 and X2, since we know this is the real relationship set in our model. Then adding each junk x-variable, calculate the new values of r-squared and r-squared adjusted. (Ideally you could run this 5000 times at a range of sample sizes to see what tends to happen.) This is a simulation study. What you have done is created some numbers that don’t account for fitting a new model with junk terms, reducing the SSE, and calculating new r-squared values (which is actually what would happen).

Harrogath wrote:

Well, we are in a finance forum, and we suppose you do too. However, I would accept disciplines like medicine and other researches could use a lot of variables without falling in the field of increasing variance of errors in the search of “good fit”.

At least, in economy and finance, the data is not infinite and a parsimonious model will always be preferred. 9 variables for an econ model is a crime. Sorry.

There is the problem, if you do talk about a model in the edge of assumptions or even violated assumptions, then your comments are correct. Otherwise they are not. A sample size of 30 is the bottom possible.

You make so many statements that are emotional and based on feelings, it seems. No one is talking about assumptions, so you’re definitely missing the point with this. Also, you’re ignoring that with less data, R-squared adjusted might be more relevant than with more observations, but you’re pointing out that limited sample size is a concern. You’re staring at food in front of you while saying you’re hungry!

Harrogath wrote:
Not narrow, you are just working on fields different from finance. You are always fighting against the CFA program because the curriculum does not teach in detail regressions for other disciplines. Sorry, but 99% of financial analysts, or investment managers in the entire world will never run a regression estimate in their lives. At most, they will interpret an ANOVA table, if so. CFAI has made its job well enough.
I fight against them because even for finance and econ they do a terrible job. A good econometrics book demonstrates that. They are often flat out incorrect. Fun fact, every ANOVA table has an underlying regression model, so again, poor job on the CFA Institute to demonstrate the equivalent cases; a regression equation likely has far more utility than an ANOVA table, but either way, because they are special, common cases of one another, these criticisms still hold.

Harrogath wrote:
Should I consider your selective set of approved books? You may publish the list somewhere.

See above. Yes, the only variable controlled was “k”. My intention was to demonstrate that R2 and R2 adjustment are not much different in a parsimonious model with a reasonable sample size.

My point is to pick a book written by someone with a PhD in stats, rather than the CFA curriculum as your reference text. Almost any would be better than the CFAI book.

Claiming “this is just for finance” doesn’t make it correct when it’s flat out wrong.

I guess I didn’t do a good job of avoiding further discussion. I will ask that before you fire back another post based on feeling on the subject that you actually do some reading on it.

Having an understanding of how a model is fit (minimizing SSE, usually) and how the usual R-squared is calculated will pretty clearly contradict many of your points. This understanding and reading will also allow you to look past your own feeling on the subject.

You argument is basically that spoons are only used for eating ice cream because cereal isn’t necessarily tastier and because you prefer ice cream to cereal.

Now, I’ll be good and leave it to that.

P.S. If you are genuinely interested in some book suggestions I’ll happily post a few.

tickersu wrote:

Harrogath wrote:

R2 is the relevant measure for explanatory power. On the other hand, R2 adjusted is meant for comparing models with different quantity of independent variables.

The latter is the penalized version of the other. It penalizes for too complex of a model relative to the sample size and relative to the improvement in reduction of the error variation.

Harrogath wrote:
In a sole regression (whatever the quantity of variables) we look at R2. If I want to compare two or more models’ R2, then, I look at R2 adjusted. I see you are misleading the relevance of R2 or thinking that an adjusted measure is superior by definition. Remember that R2 adjusted is derived from R2 and will always be lower than R2, no matter what.
I’m not misleading by introducing the work and guidance of well-trained practicing statisticians. Outside of your scope doesn’t mean it’s incorrect or misleading.

Harrogath wrote:
Also, wouldn’t see never a big difference between R2 and R2 adjusted in a parsimonious model with a good sample data (size).
No one is arguing this is the case. The adjustment in R-squared penalizes for a complicated model relative to the sample size (read that as lots of terms with small sample); so this encourages parsimony.

Harrogath wrote:
If you are talking about models built in the edge of assumptions, then you may be right. Also, I don’t know what is for you a big difference in R2 and R2 adjusted. As you saw in my simulation, changing from 3 to 50 variables R2 is dropped 15 percent points when adjusted. That difference is bad.
Literally, I have seen 5-20% absolute difference with only a few variables.

Harrogath wrote:
Again, I don’t know why you assume R2 adjusted is better than R2 because it penalizes for “junk”. This is not true, R2 captures junk through SSE. Junk is detected when T-cal’s are not statistically significant, when F-cal is not statistically significant, etc.
R-squared doesn’t penalize for junk; R-squared (error variance) never decreases (decreases) as you add more terms to the model, irrespective of their statistical utility. This is blatantly obvious in that the model is fit by minimizing the sum of squared errors which is y-yhat where yhat is b0 + b1X1 + b2X2+…+bkXk … the more terms in that, the smaller the sum of squared errors will necessarily be– this is straight forward. That increases R-squared. Junk can increase R-squared, but not necessarily the adjusted R-squared.

Harrogath wrote:
If you introduce junk into your model, both R2 and R2 adjusted will be lower.
This just simply isn’t true. You need to review how R-squared is calculated.

Harrogath wrote:

I was talking about in the scenario of R2, not other measures. Otherwise, you would be right.

You explicitly said that R-squared was the only way to tell.

Harrogath wrote:
Yes, yes. My principal motivation to criticize your comment is that S2000 was right, but instead, you added some other explanations that may be considered misleading, so wanted to clarify.
Again, being unfamiliar with the subject doesn’t mean I’m misleading. It means you have some reading to do before continuing to reply based on personal feeling on the subject. I’m speaking strictly from 1) formal education 2) advice and discussion with real statisticians 3) personal experience and observation 4) not from my feeling on the subject.

Harrogath wrote:
Just apply the below formulas:

I share with you my little model you can replicate in excel. Sorry, it seems I typed SSE = 5,060 in the forum when in fact I was running the model with 5,400. Sorry for that. Fixed above.

I’m not asking for the formula, that’s not in contention. You have shown us that you didn’t actually conduct a simulation. A good way to do this would be to generate, say a set of X,Y (with two true X variables) that are known to follow some regression model you specify for the simulation. Add some noise to Y so the relationship isn’t deterministic. Then, generate a bunch of other, unpaired x-variables with random values (simulating random independent variables that are junk). Calculate r-squared and adjusted from the multivariable model of Y, X1 and X2, since we know this is the real relationship set in our model. Then adding each junk x-variable, calculate the new values of r-squared and r-squared adjusted. (Ideally you could run this 5000 times at a range of sample sizes to see what tends to happen.) This is a simulation study. What you have done is created some numbers that don’t account for fitting a new model with junk terms, reducing the SSE, and calculating new r-squared values (which is actually what would happen).

Harrogath wrote:

Well, we are in a finance forum, and we suppose you do too. However, I would accept disciplines like medicine and other researches could use a lot of variables without falling in the field of increasing variance of errors in the search of “good fit”.

At least, in economy and finance, the data is not infinite and a parsimonious model will always be preferred. 9 variables for an econ model is a crime. Sorry.

There is the problem, if you do talk about a model in the edge of assumptions or even violated assumptions, then your comments are correct. Otherwise they are not. A sample size of 30 is the bottom possible.

You make so many statements that are emotional and based on feelings, it seems. No one is talking about assumptions, so you’re definitely missing the point with this. Also, you’re ignoring that with less data, R-squared adjusted might be more relevant than with more observations, but you’re pointing out that limited sample size is a concern. You’re staring at food in front of you while saying you’re hungry!

Harrogath wrote:
Not narrow, you are just working on fields different from finance. You are always fighting against the CFA program because the curriculum does not teach in detail regressions for other disciplines. Sorry, but 99% of financial analysts, or investment managers in the entire world will never run a regression estimate in their lives. At most, they will interpret an ANOVA table, if so. CFAI has made its job well enough.
I fight against them because even for finance and econ they do a terrible job. A good econometrics book demonstrates that. They are often flat out incorrect. Fun fact, every ANOVA table has an underlying regression model, so again, poor job on the CFA Institute to demonstrate the equivalent cases; a regression equation likely has far more utility than an ANOVA table, but either way, because they are special, common cases of one another, these criticisms still hold.

Harrogath wrote:
Should I consider your selective set of approved books? You may publish the list somewhere.

See above. Yes, the only variable controlled was “k”. My intention was to demonstrate that R2 and R2 adjustment are not much different in a parsimonious model with a reasonable sample size.

My point is to pick a book written by someone with a PhD in stats, rather than the CFA curriculum as your reference text. Almost any would be better than the CFAI book.

Claiming “this is just for finance” doesn’t make it correct when it’s flat out wrong.

I guess I didn’t do a good job of avoiding further discussion. I will ask that before you fire back another post based on feeling on the subject that you actually do some reading on it.

Having an understanding of how a model is fit (minimizing SSE, usually) and how the usual R-squared is calculated will pretty clearly contradict many of your points. This understanding and reading will also allow you to look past your own feeling on the subject.

You argument is basically that spoons are only used for eating ice cream because cereal isn’t necessarily tastier and because you prefer ice cream to cereal.

Now, I’ll be good and leave it to that.

P.S. If you are genuinely interested in some book suggestions I’ll happily post a few.

I have not much time for this, and instead of replying each sentence of yours I better sell an idea:

In the process of modeling a variable there is uncertainty about things like the appropriate model to use and the best independent variable set. A good research would imply to create an algorithm of trial and error until we get the best model possible from the statistics point of view and also the model to be economically sound (perhaps don’t apply with other disciplines). When the algorithm is running, R2 is the relevant measure. R2 adj may be used as an appropriate measure when comparing models with a different number of variables, however, since we tend to comply with the model assumptions (enough data points, parsimonious building, errors behavior, etc), R2 adj difference from R2 will never be critical.

In the scenario of exotic models using exotic data or too much proxy variables, a model’s calculated statistics can shed tricky differences as you may have encountered in your practice, however we shouldn’t generalize this.