Re: Coefficient of determination (R^2)

Shams · May 23, 2014, 11:03am

Hey Guys I am wondering if you can provide me with some pointers on correctly interpreting R^2. So, here is a hypothetical example. Suppose, that a company is regressing their market share growth or decline based on price premium or discount that they have against their competitor. So, a scatter plot diagram with the price gap as the x axis and market share in the y axis shows an inverse relationship where a narrowing of the price gap translates to higher market share. Now, a linear regression comes up with a coefficient of determination of 0.74 - this means that 74% of the variation in the market share for this company can be explained by the price gap. This means that 26% is unexplained and is related to other factors. Let’s say, that we now know that the only other factor other than price gap is themed marketing event that the company runs - such as, say, NHL event (spend $30 and submit your rebate for a free cap or scarf). So, let’s say that between two periods the market share increased from 25% to 26%. Can we say that 0.74% of this 1% is attributable to the price gap and 0.26% is attributable to the NHL event? I am not sure if I can “gratuitously” use R^2 percentage in this manner? In other words, in the above example let’s say that the 1% increase corresponds to incremental units sold of 100. Can I then say that 24% of 100 or 24 units is attributable to NHL event? What do to you think? I don’t think so but I was hoping that you can perhaps provide me with a “yes” or “no” answer. Thanks.

Robotech · May 23, 2014, 2:06pm

See below

bchad · May 23, 2014, 2:02pm

It doesn’t really work like that. R (not R^2) is basically the correlation coefficient between your model’s predictions and the observed value. One of the implications of this, BTW, is that the out-of-sample R is likely to be lower than then in-sample R, because the in-sample R is probably overfitted relative to the true (and unobservable until it’s no longer relevant) R.

To estimate the effect of a change in price, you don’t use R, you use Beta. So if you find a 1% change in market share, what you do is you compute (via your beta), what the predicted change in market share is, given however much you changed the price differential. So Beta*(change in price differential) is how much market share came from your price change.

[1% - beta*(change in price differential)] is your estimate how much came from other sources. That’s called “the residual.” If you know the only other possible source is the marketing campaign, you can go ahead and assert that the campaign is responsible for this difference. You seldom actually know that, because there is almost always noise from other factors. If you assume the noise is random, but has stable qualities of randomness (homoskedastic, mean of zero, or at least a constant mean), you can often estimate how much of the remaining change in market share comes from specific factors vs noise.

R^2 is really about how much “noise reduction” your model helps you achieve. Basically it compares the variance (i.e. (SD)^2) of the residuals in your model (the unexplained bits) to the variance of your dependent variable with no model at all (or, more accurately, a model where you use the dependent variable’s mean to predict all cases). If your model is any good, then the errors you get by using the variables in the model are systematically smaller than the errors you would get by guessing the dependent variable’s mean all the time. R^2 tells you how much smaller, as a ratio of (standard deviations)^2

Mobius_Strip · May 23, 2014, 7:59pm

Long story short, R^2 is a global measure for goodness of fit. In a linear regression, you minimize the sum of squared residuals across all data points in your set (SSres), which also maximizes your R^2 = 1 - SSres/SStotal. This tells you nothing about the local goodness of fit, in particular - how small the residual is for a specific data point from your set. Refer to R^2 when you try to describe the data set as a whole, but not individual data points.