Data mining bias

yasin · April 24, 2020, 10:34am

Hi… i read in the book that data mining is the process of looking for patterns in raw data until one that “works” is found and repeated use of same database can lead to data mining bias… and as an example… they are mentioning that empirical research shows that value stocks out perform growth stocks…and this is biased information…

I am confused and would appreciate if you guys could confirm or correct the below conclusions/confusions i have about this:

datamining is looking for patterns in the raw data which isnt a bad thing… but sometimes analysts use the same database over and over again to cherry pick and emphasize patterns that suit the outcome they want and this leads to bias.
regarding the empirical research example…is there something to be gained by showing that value stocks out perform growth stocks or is this an accidental mistake?

not_a_CFA · April 24, 2020, 6:08pm

The authors seem to grossly oversimplify how models are chosen for a given data. Good modeling strategies have nothing to do with finding a pattern that “just works” (this is closely related to p-hacking). Instead, researchers try to find a model that accurately reflect the data, with the smallest amount of variance and bias as possible. What is bias? When you measure something, your measurements will typically have some error in it. When you have a good model, the average of all the errors should tend towards zero. Some measurements will be a little over the theoretical values, while others will be lower, so that on average, they sum to zero. One way to model the data is through a regression equation. It turns out that if you have too few variables (called predictors or features) in the equation, the model is simpler and has a lower variance, but at the cost of higher bias (the error will on average be significantly different from zero), while if you choose a model with many variables, your bias tend towards zero, but at the cost of higher variance (the model follows the noise in the data too closely to be useful). The researchers try to find an appropriate balance between variance and bias - this is known as the variance-bias trade off. To be sure, this is a necessarily simplified information. There are statistical techniques to measure bias and to determine precisely what it means for the bias to be “significantly different” from zero. For more information, I strongly recommend checking a textbook dedicated to statistical regression.

While it is true that one should be careful not to draw hasty conclusions from a small set of data, the fact that, historically, value stocks outperform growth stocks comes from decades of historical data. Not months. Not years. Decades. In fact, that is one of the reasons you learn about CAPM and Fama-French-Carhart in the first place: both provide for a theoretical efficient portfolio, which is nearly impossible to beat. This makes sense sense: imagine if you are in a room with 1 million other, highly intelligent investors, and everyone have access to the exact same information. To get a return above the market, you would need to somehow outguess all the other investors who are all, in turn, trying to outguess you. Sometimes, you may be very well be lucky and get a positive alpha, but that’s just it: luck. As for value historically outperforming growth: one reason, according to Burton Malkiel (A Random Walk Down Wall Street), is due to overconfidence from investors: they are overconfident that they can predict how much the stocks will grow and pay more than they should for the growth stocks. This drives the cost of the growth stocks and makes them less profitable.