P-Value versus Type I Error

Hello,

I am having a hard time understanding the difference between the p-value and type I error. As per Kaplan, the type I error is the error of rejecting the null hypothesis when it is in fact true.

P-value is the probability of obtaining a test-statistic that would lead to a rejection of the null, assuming hte null is in fact true.

So are they saying that the p-value is the quantified mathematical description of the type I error?

Also, Kaplan has an example about the p-value that I do not understand:

*95% confidence, 5% significance, two-tailed test, Z=±1.96

*T-statistic is 2.3. The p-value is 1.07% x 2 = 2.14%. I am not sure how this figure is calculated?

*Isn’t the rejection points of a two-tailed test the critical values? Which would be 2.5% in both the upper and lower tails? So how come we reject only 2.14% of the time when we’re supposed to reject it 5% of the time?

In a nutshell:

  • Alpha (α) is the probability of a Type I error; you choose α before you start looking at the data.
  • The p-value is the probability of getting a value at least as extreme as your sample value assuming that the null hypothesis is true.

Therefore, the p-value is not the probability of a Type I error; α is. If p < α, you reject the null hypothesis; if p > α, you do not reject the null hypothesis.

Howdy! Kaplan nailed it thus far!

Kaplan is wrong here. Their statement inadvertently ties the p-value to an alpha level, which is incorrect (these are independent of one another, and originally [long ago], they weren’t intended to be used together). As S2000 said, the p-value is the probability of obtaining a test statistic at least as extreme as the observed one, assuming that the null hypothesis is true (…as well as all our other assumptions).

If they are saying that, they’re mistaken.

They’re using a table (or program) to generate a one-tailed p-value. Assuming they’re doing a two-tailed test with a symmetrical distribution, they multiply the p-value by two so it represents to total area between the two tails (i.e. to get a two-tailed p-value).

The rejection points (when using the rejection region method) are the critical values for the given significance level.

Because we aren’t. The p-value of 0.0214 is not telling us “the percent of time we reject”. As mentioned earlier, Kaplan’s explanation is erroneous. The 0.0214 is telling you that if the null hypothesis is true, you had a 0.0214 probability (2.14% chance) of observing a test statistic (or sample mean, for example) that was equally or more extreme than the observed one. Long story short, the smaller the p-value, the more our observed data disagree with the null hypothesis. Once we’ve reached our threshold for disagreement (say, below .05) we reject Ho.

Hope this helps!

Both of you have made these concepts easier to understand.

With that said, I truly want to grasp statistical theories and thus have a few clarifying questions if you guys don’t mind:

What is the benefit of a p-value? If it is the probability of obtaining a test statistic that is as or more extreme than the observed statistic (based on assumptions on the null and other assumptions), how is that useful?

Going back to p-value = 0.0214, it suggests that there is a 2.14% chance of getting a test statistic that is as or more extreme than the observed statistic. My question is: So what? Also, does this example basically say that there is a 2.14% chance of obtaining 2.3, a statistic that is more extreme than the observed statistic of 1.96?

I know that p-value < 0.05 is the rule of thumb, but why?

As a reference point for other candidates reading this thread, Kaplan (incorrectly) states on page 274 of Book 1 (June 2017 edition) that the "p-value is the probability of obtaining a test statistic that would lead to a rejection of the null hypothesis, assuming the null hypothesis is true."

The p-value, more or less, summarizes the strength of disagreement with the null that our data provide for us to interpret. It’s going to help us make an inference about the population. We start by saying assume Ho is true. We collect data, and we analyze it in some manner. When we look at that summary, the p-value lets us know how reasonable our data are, assuming the null is true (in less precise terms). So, we look at the p-value and ask_, "Based on the data, can we say the null hypothesis is unreasonable?" When the p-value is small, it tells us one of two things (you need to decide which). 1) I_f the null hypothesis is true, then we’re seeing something that’s unlikely (i.e. we observed a fluke) or 2) these data are really incompatible with the null hypothesis (because we had a low chance to see this or more extreme results), therefore, I suspect the null hypothesis does not hold (assuming our other assumptions are valid). Again, this is somewhat loose, but the general idea is there (are we seeing something very unlikely, or are we seeing that our assumption of Ho might be false?). Your job is to pick an alpha level to decide your comfort in making a Type I error (rejecting Ho, when Ho is actually true).

My explanation above covered this somewhat. The point is that it helps us quantify how much our observations disagree with the null hypothesis. If, under a true null, you have a small chance of seeing something at least as contradictory to the null hypothesis, you can begin to consider that the finding isn’t a fluke and that the null is incorrect (assuming the rest of your process is good and appropriate).

Not 1.96 (selected critical value), but the calculated test statistic (2.3, in your case). Remember, the two tailed p-value means “more extreme”= including and below -2.3 as well as including and above +2.3 (more extreme in both tails).

There’s some historical context to it and some psychological context. Realistically, you would weigh the risks and benefits of choosing an alpha level, many of these depend on context and the rest of your research. If you’re running many tests, you might opt for a much lower threshold to keep the overall Type I error rate low (think of a compounding effect for every test you run). If you’re doing something without any prior theory, you may pick a very small alpha (in the hopes of keeping the Type I error risk very low) or you might choose a larger alpha (say, 0.15, to prevent making a Type II error by taking an action consistent with a true null). It’s highly situational, but the rule of thumb is basically arbitrary since there isn’t a one-size-fits-all alpha level.

  1. I_f the null hypothesis is true, then we’re seeing something that’s unlikely_

We can assume the null is true because we’re just seeing unlikely/extreme outcomes that are not probable.

or 2) these data are really incompatible with the null hypothesis (because we had a low chance to see this or more extreme results), therefore, I suspect the null hypothesis does not hold (assuming our other assumptions are valid).

So, if p < alpha, it suggests that the data is incompatible with the null hypothesis and thus we reject the null hypothesis? Rationale: We reject the null because there is a small chance (2.14%) of this or more extreme values (like 2.3 in the example).

Are these correct interpretations?

In the CFA, how would we decide which of the two interpretations to use? Would we focus on the second situation or the first when the p-value is small(er) at a given significance level? It’s just challenging to know when to reject or not reject the null given that we assume it’s correct given our use of the p-value? This is especially concerning in 1). Why should we assume the null is true? That defeats the purpose of hypothesis testing?

Also, what happens when the p-value is large?

In general, when p < α, CFA Institute wants you to say that you will reject Hₒ. However, they also want you to acknowledge that you will incorrectly reject Hₒ part of the time; the probability of doing so (i.e., making a Type I error) is α.

When p > α, CFA Institute wants you to say that you _ will not _ reject Hₒ. However, they also want you to acknowledge that you will incorrectly fail to reject Hₒ part of the time; the probability of doing so (i.e., making a Type II error) is (1 − the power of the test); i.e., the power of the test is the probability of rejecting Hₒ when it’s false. They expect you to know the name of that probability, but not how to compute it.

You’re correct in your interpretation, but I could have been more clear. These two possible explanations must be chosen between when you examine the results of your analysis and see a small p-value (alpha helps us do this). Really, the idea is that, if the p-value doesn’t exceed alpha, then we conclude number two. Concluding that the null hypothesis should be rejected is more reasonable than concluding that we saw something rare (more or less). Think of a carnival game. Sometimes, they seem much harder to win than you would expect with a fair game. So, rather than just assuming you (and everyone else) have really bad luck, you conclude the game is probably rigged to an extent.

In general, number 2 is what you’re going with when the p-value is less than alpha (this is essentially the modern practice and what you’re doing on the CFA exam). In reality, you can still give thought between the two, and investigate further if needed (this was the original intent, of the method, to question, to rerun experiments, etc.).

That’s why we set our threshold for making a Type I error by picking alpha ahead of time and using it in our decision making (this will help us decide to reject or fail to reject Ho).

Again, examples 1 and 2 were a way to try and help conceptualize what our though process is when we see a small p-value (is this a fluke or is this sufficient evidence to suggest the null is wrong?) Again, if you had to make a bet, you’d probably say it’s not a fluke, but that it’s evidence against Ho being true. We just assume the null is true as a starting point, sort of as a straw man. We say, “I have these data, and I suppose that Ho is true. Now, if Ho is true, the p-value suggests these data don’t fit well with Ho. Therefore, I reject Ho.” This is letting the data guide us to an inference (we have observed the data, but have know way of knowing the truth). It seems reasonable to say that our hypothesis should be rejected when the data are against it.

When the p-value is large it’s a similar idea. The data don’t disagree strongly with Ho (note, this is not evidence for Ho, nor is it proof of Ho). We say, this data is not so extreme to enable our rejection of Ho. We haven’t proved Ho, and haven’t found evidence of Ho, but we do know that we haven’t found evidence disagreeing with Ho to an appreciable extent. Long story short, we’re not seeing anything considered odd, if our null is true, so we fail to reject the null.

I hope this helps. Sorry if I was a little unclear before with statements 1 and 2. They were intended to illustrate an underlying thought process.

Once again, thank you S2000 and tickersu for helping me out!

The carnival example you provided may help me understand what’s going on…

So if p-value < alpha, we concluded that the null should be rejected. The alternative says that the game is somehow rigged and therefore it is NOT statistically significant that the observations have bad luck.

Rationale: The reason this is the case is because our % chance of getting the “bad luck” observation or even worse luck is very low. Therefore, we reject the notion of bad luck and the alternative is that there is some sort of rigging going on.

Does my argument make sense? If it does, I think I finally grasp the concept!!

Glad it was useful!

Correct.

The Null is that the game is fair, the alternative is that the game is not fair (rigged). If our experience at the carnival (data) cause us to reject Ho (the game is fair), the we conclude Ha (the game is not fair). This would be a statistically significant finding. We would be saying, “I don’t have this much bad luck (reject Ho). The game must not be fair (conclude Ha)!”

If our experience with the game was good, and we have no suspicion of a rigged game, then we would fail to reject Ho (fair game). We would be unable to conclude the game is not fair/rigged (Ha). This would be a nonsignificant finding. Keep in mind, though, that this does not act as evidence of Ho (fair game) nor does it prove Ho (fair game). It just means our data don’t disagre__e enough with Ho to reject it.

With the tweaks I made earlier in this post, you’ve got it. Your “Rationale” captures the essence of this idea.