I want to begin with an apology for the length of this post. It is long, but the payoff is huge (especially since it was hinted at that there may be a question on a future AP exam that follows this idea closely).
So, I will kick it off with great way of writing Null and Alternative Hypothesis statements that is very understandable and fits with the ideas written about in the first post in this series.
There are only 2 conclusions you can write when you are doing inference testing and ONLY these two which mirror the types of errors:
1. Because the pvalue of (__) is less than alpha = (__) we reject the Ho. There is convincing evidence that (Ha – write out).
2. Because the pvalue of (__) is greater than alpha = (__) we fail to reject the Ho. There is not convincing evidence that (Ha – write out).
Type I Error: Finding convincing evidence that Ha is true when, when it really isn’t.
Type II Error: Not finding convincing evidence that Ha is true, when it really is.
This is the format / structure of Null and Alternative answers and formulation for type I and type II errors that Josh Tabor uses in class. It is aligned with the two statements he uses at the beginning of the year, as well as the evidence / convincing evidence questions he asks.
But it doesn’t really explain power, yet.
To do that, let’s look at a story that can be used in class.
A particular high school batter, Johnny, hit .250 last year’s season, and was not very happy with that average. Over the summer Johnny attended several batting and baseball camps and feels he improved. Not just improved a little, but improved dramatically! Johnny now feels he can hit .333 based on his performance at camps, and he is ready to take a larger role on the team. However, the coach still believes Johnny is a .250 hitter. The coach, however, has a good grounding in stats, so he sets up a good batting experiment for Johnny. He is going to take a sample of 20 pitches, average the number of hits, and make the decision based on that sample.
At the end of the practice, Johnny hit 9/20 or .450! Johnny says this means he is way better than .333, but the coach disagrees and says the .333 is reasonable.
So what is going on here? At the beginning of the year, we would have two options:
A. Johnny achieved that number of hits through random chance
B. Johnny is actually better than the coach thought
Do we have evidence for one or the other? Do we have convincing evidence?
At this point of the year we would have a Null and Alternative:
Ho: Johnny is not a .333 batting average batter
HA: Johnny is a .333 batting average batter
We could use the p-value and the alpha level we choose to make a decision, and we are finished. Right?
Not so fast. What about the power of the test created by the coach?
Well, let’s start with what we know about power. What 4 things can change, or have an effect on the power of a test?
1. Make alpha bigger. Not requiring as much convincing evidence.
2. Increase sample size. More information makes making a decision easier. The curves become narrower, which makes beta smaller.
3. Make SD smaller by better experimental design, think the Rania problem earlier.
4. Have the means of the two curves further apart, which is also called the Effect Size.
I will post some screen captures here to help explain.
The top distribution with the red little end is the Null Hypothesis. The red dots (all three of them) are the number of samples of size 20 that our simulation had 9/20 hits. We did a hundred samples, so we should be confident that the curve is reasonably distributed. The p-value associated with Johnny having 9 hits out of 20 is .03, so we have an answer to our question. The coach should believe that Johnny improved from a .250 batting average.
But here is the payoff.
The bottom distribution with the green end is the Alternative hypothesis. The number of green samples, 21/100 is the POWER of the test. The coach has a 21% probability of not making a type II error here! The black part of the distribution is the beta, 79/100 or 79%.
Notice the mean of the top, Null, distribution is approximately 5, while the mean of the bottom distribution, Alternative, is approximately 6. The difference between them is the Effect Size. How can we increase the power of this test? We can take increase the sample size from 20 to 80. That will reduce the spread by half. That would give us this picture:
That increased the power to 37%. Of course, we would have to know how many hits Johnny had on the test with 80 at bats. I would think he would be one tired baseball player by that point.
Josh hinted strongly that a “calculate power” problem is coming on the AP exam, and this is similar to the format that would be used. It isn’t that hard when you are looking at it like this. In fact, power makes a lot more sense.