Interpreting the Correlation Coefficient

Hey AF,

I know correlation measures the strength of the linear relationship between two variables. For positive correlation, it says, that a correlation < 1 indicates that when one variable increases (or decreases) the other tends to increase (or decrease) as well.

But if you have a correlation of say, 0.70 – other than saying there is a “moderately strong linear relationship between the two” – how do I specifically interpret the 0.70?

Does it mean that 70% of the time the two variables move in the same direction? Or 70% of the time they move in the same direction in a straight line?

Thanks in advance for any response.

Its interpretation is quite similar to that of Beta. So if there is a correl of 0.7 between the market and a stock, if the market moves up 10 pts, the stock should move up 0.7 times 10 pts i.e. 7 pts.

With all due respect, this answer is completely wrong: not only in its analogy of correlation to beta, but also in its interpretation of beta.

First things first: a correlation coefficient of +0.7 does not tell you anything – _ anything! _ – about the relative change in y compared to x, apart from it being positive. The standard deviation of y can be equal to that of x, or twice as big, or half as big, or 10,000 times as big, or 0.00001 times as big, or anything else (other than zero). All a correlation coefficient of +0.7 tells you is that:

  • 49% (= 0.72) of the change in y is explained by the change in x
  • y and x tend to be on the same side of their respective means (y above μy and x above μx, or y below μy and x below μx) more often than they are on opposite sides of their respective means (y above μy and x below μx, or y below μy and x above μx)

Second: a beta of +0.7 does not mean that if the market moves up 10 points, the stock should move up, on average, 7 points. Beta is not calculated using prices; it is calculated using _ returns _. A beta of +0.7 means that if the market’s _ return _ changes +1.0%, the stock’s _ return _ should change, on average, +0.7%.

Please be much more careful in your answers here.

  • 49% (= 0.72) of the change in y is explained by the change in x
  • y and x tend to be on the same side of their respective means (y above μy and x above μx, or y below μy and x below μx) more often than they are on opposite sides of their respective means (y above μy and x below μx, or y below μy and x above μx)

No.

No.

All it means is that:

  • 49% (= 0.72) of the change in y is explained by the change in x
  • y and x tend to be on the same side of their respective means (y above μy and x above μx, or y below μy and x below μx) more often than they are on opposite sides of their respective means (y above μy and x below μx, or y below μy and x above μx)

You’re welcome.

Agreed ! Idk what I was thinking. Thanks for the hit on the head.

Thanks again S2000, you explained it very well. For r-squared though when you say 49% of the change in y is explained by the change in x — is it appropriate to say, when x is above (or below) its mean, 49% of WHY y is above (or below) its mean is explained by x? With the other 51% coming from some other variable(s)…

Or actually what I meant: is it correct to say when the return of x is above its mean, 49% of the return of y (which should also be above its mean) is explained by x? Meaning, if the return of y is 3% above its mean, 49% of that excess return is explained by x, with the other 51% coming from some other factor…

Am in the ballpark?

It doesn’t get that fine.

When I say that 49% of the change in y is explained by the change in x, I mean that in a macro sense: 49% of y’s standard deviation is explained by x’s standard deviation. I believe that you cannot narrow that to a specific instance (y being 3% above its mean, for example).

This is a good question for tickersu; I’d love it if he would wade into the discussion.

In the strictest sense I think you should write “49% (= 0.72) of the change in y has been explained by the change in x within the period used to calculate the correlation coefficient

This is wrong. From the definition of the correlation coefficient:

rhoxy = cov(x,y) / sxsy where cov(x,y) is the covariance between x and y and s is the standard deviation of x and y

also

cov(x,y) = (1/n) * sumi=1 to n (xi - E(x))*(yi - E(y))

with the expected value of x denoted as E(x) (here E(x) = mean value) and

sx = (cov(x,x))^0.5

Your argument (same side of expected value) gives you a positive sign for the respective term in the covariance, but besides the sign also the magnitude of the term needs to be considered. A high covariance (and hence a high correlation) can also be reached by being on the same side of expected value less often than not, but when the values are on the same side they are far away from there respective means and when they are not on the same side they are near to their means. As an example consider the data set (xi, yi): {(1, 2), (1, 2), (1, 2), (10, 26), (1, 2), (1, 2), (-10, -14), (1,2)}

For this data set you find:

E(x) = 0.75

E(y) = 3

Here x is above its mean in 6 out of 8 cases and while it is above its mean y is below its mean (in the same 6 out of 8 cases). So in the majority of cases they are on opposite sides of their respective mean. Analyzing the data set further:

s(x) = 5.02

s(y) = 10.85

cov(x,y) = 49.25

rhoxy = 0.9

So x and y are highly correlated despite x and y being on opposite sides of their respective means more often than not.

I think your interpretation is about as good as you can do in the absence of other data. The correlation coefficient alone doesn’t really tell you that much.

Thanks Nukular, your example was helpful. I have a few questions:

  1. You said, “I think your interpretation is about as good as you can do in the absence of other data.” Other than adding more variables, what other data do you look at for further explanation?

  2. As in your example, if x and y have a high positive correlation (0.90) but even so, when ones above its mean and the other is below (and vice versa) how do you make any inferences about the results? If I was to put this in a model, it seems like even if your right in your prediction for x, the models prediction for y can still be way off – even with a high r-squared…So I guess what’s the point of statistics if you can’t rely on anything?

  3. Is the best way to infer any probability about y being correct done by – a) using the data and creating the regression line, b) hoping that you predict the actual value for x, c) taking your y-hat value and doing plus/minus two standards errors to infer a 95% probability of it being correct (over the long term assuming the sample repeats itself).

It just seems like with statistics if your wrong your wrong and if your right, you’re probably going to be wrong as well…

  1. I believe that scatter plots are extremely helpful to see if your correlation coefficients are meaningful. It is also beneficial to include your predicted fit in the plot. Here is an example from another website. If you work with more than 3 variables at once scatter plots become less ideal, but even then I still like them to get a first “view” on the data or subsets of data.

2 & 3) If you find a data set like the one I have given above you should probably not use a linear fit of the data for predictive purposes. (Look at an ANOVA table for the data set above and a simple linear fit.)

Good point.

That’s why I wrote “tend to be” rather than “are”.

Thanks guys for the help; makes it much more clear.

As far as I know, it’s not granular in that manner. On a magnified view, you may completely “explain” a given observation, but on another, you may do quite poorly. I agree with your points.

As pointed out, visualizing the data is a powerful tool. CFA curriculum is really garbage for stats. One of the FIRST things you should do with an appropriately cleaned and coded data set is data exploration with descriptive/graphical methods to get a feel for it. A good example why is a perfectly quadratic relationship plotted on a graph. The Pearson correlation would be zero, but initial data exploration with a graph would show there’s a strong nonlinear relationship between two variables.

Correlation isn’t use too much for inferences other than just for an inference about the strength and magnitude of linear relationship in the population. You’re generally not predicting X in the model (sans cases of imputation). Predicting individual Y values may be way off but that’s where overall accuracy may be worth examining. Poor predictions can help guide better models, though. An in sample adjusted R-squared isn’t really anything to write home about. There are other kinds of cross validation that attempt to split data from model building and “prediction” without getting a new data set, but there are limits to each, see K-fold, jackknife, and various other resampling methods. The point is that there’s an art and a lot of skill that’s needed to do it well, contrary to what all of the six-sigma, data-scientist “statisticians” say about themselves (not real statisticians).

No. Confidence intervals and prediction intervals (special kinds of CIs) are not able to tell you “95% chance of being in the interval/correct”, contrary to common misunderstanding. For example, let’s suppose you create a 95% CI whose upper bound happens to fall EXACTLY on the actual (and unknown) value of Y (say, 3) we are trying to predict. If the sampling distribution is perfectly normal around that actual Y value of 3, then AT MOST, 50% of future Y observations would fall within the calculated 95% CI of (lower bound, 3). The phrase “95% confident” refers to the long run accuracy of our methodology for generating Frequentist confidence (and prediction) intervals. That is, the true parameter value is fixed at, say, mu, and 95% of all possible 95% CI/PI intervals will contain mu if we repeatedly and infinitely draw random samples and calculate these intervals for mu. If you want to make statements such as “there is a 95% probability Y/Mu is in the interval [x,y]” you need to employ a Bayesian credible (posterior probability) interval, which isn’t covered in the CFAI curriculum at my last glance years ago.

You don’t have to be exactly right, you just have to be good enough, overall. The curriculum being as poor as it is, it doesn’t surprise me that you’re feeling this sentiment!

Hey tickersu, thanks for taking the time to respond. I definitely agree there is an art to statistics — you can’t simply rely on a checklist. Can you recommend any good statistics books?

Probably.

(Sorry. :wink:)

It depends on your background (mathematics/hard science with strong math vs social science with less mathematics) and what you’re looking to learn (theory, application, both).

I’d be happy to reply in the thread but this will basically clog up the thread for a tangentially related topic. Feel free to message me and we can discuss some books!