# why is correlation so high for whole data set and so low for smaller parts?

Can someone plz help me with this? I’m running correlations on two variables. I have monthly data for the last 7/8 years. If I just run a simple correlation between the two variables, I get a pretty low number (-.067). But if I plot teh charts, the correlations seem much higher, theres a very distinct pattern. Plus when I try correlations for each year, the correlations are a lot higher. Year 1 (correlation = -.716), Year 2 = -.8964, year 3 = -.9034, year 4 = -.5643, year 5 = -.8728, year 6 = -.8348, year 7 = -.8829, year 8 = -.8649. and also if i break up the data into two halves, each half has a very high correlation. half 1 correlation = -.59, half 2 correlation = -.798 i just dont understand how come the correlations are so high otherwise, but for my whole data set its so low. im trying to maximize the correlation for the whole data set and cant seem to get anything. ive tried removing seasonality because there is a seasonality effect but that doesnt seem to increase the correlations all that much. any ideas? or any explanations as to the differences in correlations? thanks a lot guys!

cant say more without analyzing your specific dataset, but remember that the standard correlation metrics such as person correlation coefficient measure the degree of linear relationship between 2 variables. of course there may be strong non-linear relationship that might be evident from plotting the data which would give you a low correlation coefficient. remember that uncorrelated does not mean indepenent at all. plus, it’s a lot more likely for some relationship to be approximately linear over a short time period (i.e. high correlation coefficient) but otherwise very non-linear if you zoom out and look at it over a longer time period. also read this: http://en.wikipedia.org/wiki/Anscombe’s_quartet

Looks like something wacky happened in year 4. What are you regressing? Are you regressing something like index levels or are you regressing returns? When you say you plot the charts, are you plotting a scatterplot of one variable vs. the other? Or are you plotting a time series and noting that the shapes look similar, rising and falling at the same time?

thanks a lot guys! mobius, thx for the link. ive tried linear, quadratic, cubic, logarithmic correlations and still the numbers are very low when i look at the whole series. from the charts, i think there seems to be a linear correlation tho, but there is a lot of seasonality and so in the charts the lines move up and down a lot. bchadwick, im regressing mortgage monthly retention rates vs the amount of originations each month. so when more mortgages are originated, more ppl are buying and selling homes which means more loans are being liquidated so the retention rate should be lower. i plotted line charts comparing the two variables over time and i can see when one falls, the other rises. i just dont understand how if year 4 seems to be ruining my correlations, then how come year 4’s correlation on its own is still pretty high, especiallycompared to the number im getting with the whole series. but when i plot the charts i notice theres a lot of up and down movements in the variables

That’s why supporting a view / allocation process based on historical correlation has it’s shortcomings. Don’t treat year 4 as an anomaly but as an event that can certainly happen again.

but the thing is even if i remove year 4 from my data set, the correlation on the whole dataset is still very low (-.04)

It can (and often does) happen that a dataset has a higher/lower correlation in a sub-period then in the whole set. But generally what you have is that in a given sub-period the correlation is significantly different than you’d expect - not the other way around. So I can’t see how the case you’re describing is possible. That being said, and not wanting to jump the gun without seeing your data, I’d say you’re miscalculating something somewhere. If you want, put up the data up on a Google Doc spreadsheet and we can all crowd-source this problem.

Sounds like you may have some outliers or dirty data. There may be a data filter that you’re using works appropriately with subperiods that isn’t working when you just throw the whole data set at it. In other words, there’s other data in your full sample that shouldn’t be there (or alternately, data in your subperiods that shouldn’t be there).

thanks guys, ive put the doc up on a google doc spreadsheet, if any of u guys have any ideas or found anything i did wrong, id appreciate it! also this is the first time im using google docs, so i hope i did this right. https://spreadsheets.google.com/ccc?key=0AswxLMSW9_G1dHY3S1ZqckFRUHhQci1WenNVUmRDVFE&hl=en

Well there’s definitely seasonality in there that you need to account for. After doing that, try doing your analysis with the change in the retention rate rather than the retention rate itself. I only looked at this without the seasonal adjustment and didn’t find that evidence of a unit root (so that’s not the motivation for differencing, just happened that the correlations were more stable this way). Makes sense though, origination can be thought of as the change in mortgages outstanding (excluding mortgages paid off) so you should also want to compare that to the change in retention rather than the retention itself. Without accounting for seasonality, the correlations were consistent at around -20% over the full sample, the first half, and second half.

jimhohn you are calculating your annual correlations using monthly data, i.e. 12 data points? your 95% confidence interval around the yearly correlations is so wide they are good for nothing in my opinion

thanks guys! jmh, that sounds like a good idea, the correlations do increase quite a bit with the differences! mobius thats right, i am using only 12 data points to calculate correlations so i guess the correlations can’t be very accurate. but i always thought it would be harder to find a high correlation given such a small sample size compared to a larger sample size?

You can certainly find a very strong correlation with only a few datapoints, the confidence though will be very low. Your data has a very strong negative correlation within one year, however, if you plot the series you see that long term there is an opposite trend (they move in the same direction). This is why you get low correlation over the entire period.

One of the datasets can also go through a regime shift. For example, if you calculate correlation between datasets for period I (1-12) and period II (13-24), correlations will be -1 (I used identical returns for dataset1 for period I and II and shifted returns of dataset 2 by 20 from Period I to Period II). For the whole period though correlation is -0.0995. There could be lots of explanations but the best approach would be to plot the data as Mobious and others have suggested and then figure out what’s happening. Period dataset1 dataset2 1 1 -1 2 -1 1 3 1 -1 4 -1 1 5 1 -1 6 -1 1 7 1 -1 8 -1 1 9 1 -1 10 -1 1 11 1 -1 12 -1 1 13 1 19 14 -1 21 15 1 19 16 -1 21 17 1 19 18 -1 21 19 1 19 20 -1 21 21 1 19 22 -1 21 23 1 19 24 -1 21

jimjohn Wrote: ------------------------------------------------------- > mobius thats right, i am using only 12 data points > to calculate correlations so i guess the > correlations can’t be very accurate. but i always > thought it would be harder to find a high > correlation given such a small sample size > compared to a larger sample size? if i have 2 datapoints, i can fit a line through them - the correlation will be perfect (-1 or 1). add a third datapoint, and it will drop. it’s hard to find high correlation with a larger sample size and when you do, it is more likely to be statistically significant given the large number of datapoints

thanks a lot guys, makes sense now! i never even noticed treynor’s point that although the short term correlations were high, in the long run theyre both going in the same direction.

Mobius Striptease Wrote: ------------------------------------------------------- > jimjohn Wrote: > -------------------------------------------------- > ----- > > mobius thats right, i am using only 12 data > points > > to calculate correlations so i guess the > > correlations can’t be very accurate. but i > always > > thought it would be harder to find a high > > correlation given such a small sample size > > compared to a larger sample size? > > > if i have 2 datapoints, i can fit a line through > them - the correlation will be perfect (-1 or 1). > add a third datapoint, and it will drop. it’s hard > to find high correlation with a larger sample size > and when you do, it is more likely to be > statistically significant given the large number > of datapoints That’s true, but I’d say that the fact that out of 8 years, 7 had correlations in the -0.83 to -0.90 range suggests that there is something relatively stable, even if the confidence intervals are a mile wide. There’s presumably some hypothesis test one can develop for that (ANOVA or somesuch) to establish this, but I’m too tired to think of it. Probably the better suggestion is the one made earlier to regress changes in levels on each other. There may persistent effects that need to be filtered out to support the OP’s hypothesis. I’d also run a running correlation, say 24 months and try to see how it wanders over time. You may see a regime shift in the correlation, or just one time period where things go wacky (like in year 4 or whatnot). How did you guys manage to look at the data and test for unit roots and such? I was unable to copy anything, or to create cells to do the tests I wanted.

Just opened it up in google and copied it to excel. There is very clear seasonality in the data.