R Squared

I’m having a hard time wrapping my head around this concept. A high R Squared indicates that the security performs closely to an index or passively managed fund. A low R Squared indicates that the security performs nothing like an index or passively managed fund. Either better or worse. Can someone elaborate more about this? How are R Squared and Correlation related likewise?

If I remember correctly, you take the square root of Rsquare to calcuate the correlation to the index or whatever you are measuring against. Rsquare is a statistical figure that tell you what percentage of the variation of the dependent variable is explanined by the independent variable.

R^2 is the percentage of the variation dependent variable Y that is explained by the variation in the independent (input variable) X (COVab / SDa*SDb) correlation (r, rho) is the strength of the bounded covariance from -1 to 1

the index fund is the independent variable and the security (stock) is the dependent variable. if the index moves 10% and the security subsequently doesnt move (i.e zero correlation) then it has zero r^2.

When I was in college, I took a one on one class in quant/statistics (just the teacher and I) on comparing buckets of US vs Asia data for a project I was working on, using linear regression models, SPSS, and the like. After analysis, he finally came to the concusion that the data had too low of an R squared. At the time I didn’t really fully understand what that meant, but he said quite frankly “oh yeah, all of this analysis is basically useless as the Asia data did not explain the US data.” Also, we had another professor go to Asia for the six months prior to research all of this data. In the end, the low R squared meant that we couldn’t really explain the dependent (US) variable from the independent variable (Asia).

OK, a few things If you do a linear regression (the usual kind), R (not R^2) is the correlation between your prediction and the observed data. So if you have only one variable as the independent (or explanatory) variable, then R = ABS(correlation between dependent and independent variable). So they are similar. If the explanatory variable is a market index and R is high (R^2 should be high too if that is the case), then the dependent variable is mostly “explained” by the action of the independent variable. So that would mean that most of Y can be explained/predicted by what happens to X. If X is an index, it means that Y pretty much follows the index. R^2 has a (percentage of variance explained). So if you looked at the distribution of Y all by itself and used the average value of (deviation from the mean)^2, you’d get a certain value, called the unconditional variance. If you use X to predict Y, you should get a bunch of predictions that are on average closer to the observed values than just predicting Y by using MEAN(Y). Now add up the squared deviations between your prediction and the observed value and take the average. The average amount that your predictions are off (squared) should be smaller than the unconditional variables. If your model is good, it should be a LOT smaller. When you say that R^2 is the variance explained, you are basically talking about how much smaller the remaining variance is after using the model. Basically it’s: R = 1 - [(variance of residuals) / (unconditional variance)] Maybe that helps?

My experience with using the R-sqrd statistic is in performing regressions. R-sqrd basically defines the “goodness of fit” of a model. Say, in a crazy example, you are trying to model the S&P returns and you randomly select variables like daily temp in NY city, number of daily auctions on Ebay, and so on and so forth and run a regression, you’d come up with a some R-sqrd value that will probably be very low(if not zero in this case). Interpretation: there’s some way to explain the S&P returns based on these crazy variables. The problem with R-sqrd statistic is that as you add more variables to the regression, the value of R-sqrd keeps increasing giving the modeler a false sense of strength of the explanation power. The more important statistic is called the “adjusted” R-sqrd which increases as more variables are added to a regression and starts decreasing at some point in time as addition of another variable does not add any benefit. The modeler stops at that point and declares the regression to be at its optimal level. Clear as mud??

Thanks everyone. These examples help out alot.

yeah as kutta said, one of the key points to note is that in multivariate regression, R-sqrd is useless because it is guaranteed to increase if you add a new variable. In multivar you need to look at Adjusted R-sqrd

If that’s the case, then wouldn’t the correlation ® be enough? It seems everying R^2 tells you is available with R. Caveat: I’m yet to review quant.