# Data Analysis

So I have two columns in Excel ( about 30,000 row). Say the first column is the Source “s” and the second is the Copy “c” ( My boss is suspecting that the copier cheated from the source - had access to non-public info before it became public - , and asked me to prove or dis-prove so). Any idea how to compare the two columns to eventually reject ( or fail to reject) the null that the Copier had access to the data in the Source and actually cheated ?

Err. This is a tough question. What other methods could the guy have used? That is, was there a way to forecast this data with some degree of accuracy?

ohai Wrote: ------------------------------------------------------- > Err. This is a tough question. What other methods > could the guy have used? That is, was there a way > to forecast this data with some degree of > accuracy? No forecasting involved. We are talking about valuations and there is a critical piece of information that the Copier could have accessed ( from another division in his/her bank ) that would have helped him/her with the valuation.

We need more information than this. What kind of data is in the spreadsheet? How do you expect the rows to look if the suspected cheater did not cheat and how would they look if they did cheat?

The reason why we suspect that the cheater cheated is that the data is broken down into two categories. His model works perfect for one segment but does not do a good job on the second segment. Given his explanation for the model and how it works, there should be no difference in performance between the two segments. ( Based on the math involved). ( We also have data from other banks and their models do not have this breakdown in performance between the two categories).

JoeyDVivre Wrote: ------------------------------------------------------- > We need more information than this. What kind of > data is in the spreadsheet? How do you expect the > rows to look if the suspected cheater did not > cheat and how would they look if they did cheat? I was trying to get some insight from statistical methods for detecting cheating in Multiple choice tests, but these models make assumptions that I can’t make ( like there is one correct solution to each question, or that there is a large population of test takers from which to draw a probability distribution - which I don’t have). If I test for independence between the source data and the copier data ( don’t know how yet), would that be an acceptable analysis in this case ? ( My assumption is that if the cheater did not cheat, his results will be statistically independent from the Source).

this is somewhat of a wild suggestion, but could you apply Benford’s law in some form - in certain circumstances it can be used to detect data sets which have been messed around with. it’s really unclear if this is applicable at all unless we have more info though