I want to begin with an apology for the length of this post. It is long, but the payoff is huge (especially since it was hinted at that there may be a question on a future AP exam that follows this idea closely).

So, I will kick it off with great way of writing Null and Alternative Hypothesis statements that is very understandable and fits with the ideas written about in the first post in this series.

There are only 2 conclusions you can write when you are doing inference testing and ONLY these two which mirror the types of errors:

1. Because the pvalue of (__) is less than alpha = (__) we reject the Ho. There is convincing evidence that (Ha – write out).

2. Because the pvalue of (__) is greater than alpha = (__) we fail to reject the Ho. There is not convincing evidence that (Ha – write out).

Type I Error: Finding convincing evidence that Ha is true when, when it really isn’t.

Type II Error: Not finding convincing evidence that Ha is true, when it really is.

This is the format / structure of Null and Alternative answers and formulation for type I and type II errors that Josh Tabor uses in class. It is aligned with the two statements he uses at the beginning of the year, as well as the evidence / convincing evidence questions he asks.

But it doesn’t really explain power, yet.

To do that, let’s look at a story that can be used in class.

A particular high school batter, Johnny, hit .250 last year’s season, and was not very happy with that average. Over the summer Johnny attended several batting and baseball camps and feels he improved. Not just improved a little, but improved dramatically! Johnny now feels he can hit .333 based on his performance at camps, and he is ready to take a larger role on the team. However, the coach still believes Johnny is a .250 hitter. The coach, however, has a good grounding in stats, so he sets up a good batting experiment for Johnny. He is going to take a sample of 20 pitches, average the number of hits, and make the decision based on that sample.

At the end of the practice, Johnny hit 9/20 or .450! Johnny says this means he is way better than .333, but the coach disagrees and says the .333 is reasonable.

So what is going on here? At the beginning of the year, we would have two options:

A. Johnny achieved that number of hits through random chance

B. Johnny is actually better than the coach thought

Do we have evidence for one or the other? Do we have convincing evidence?

At this point of the year we would have a Null and Alternative:

Ho: Johnny is not a .333 batting average batter

HA: Johnny is a .333 batting average batter

We could use the p-value and the alpha level we choose to make a decision, and we are finished. Right?

Not so fast. What about the power of the test created by the coach?

Well, let’s start with what we know about power. What 4 things can change, or have an effect on the power of a test?

1. Make alpha bigger. Not requiring as much convincing evidence.

2. Increase sample size. More information makes making a decision easier. The curves become narrower, which makes beta smaller.

3. Make SD smaller by better experimental design, think the Rania problem earlier.

4. Have the means of the two curves further apart, which is also called the Effect Size.

There is a handy applet from Rossman/Chance that will help us out tremendously in this question. The one we specifically want is on Power (and oddly enough has a baseball set-up J).

I will post some screen captures here to help explain.

The top distribution with the red little end is the Null Hypothesis. The red dots (all three of them) are the number of samples of size 20 that our simulation had 9/20 hits. We did a hundred samples, so we should be confident that the curve is reasonably distributed. The p-value associated with Johnny having 9 hits out of 20 is .03, so we have an answer to our question. The coach should believe that Johnny improved from a .250 batting average.

But here is the payoff.

The bottom distribution with the green end is the Alternative hypothesis. The number of green samples, 21/100 is the POWER of the test. The coach has a 21% probability of not making a type II error here! The black part of the distribution is the beta, 79/100 or 79%.

Notice the mean of the top, Null, distribution is approximately 5, while the mean of the bottom distribution, Alternative, is approximately 6. The difference between them is the Effect Size. How can we increase the power of this test? We can take increase the sample size from 20 to 80. That will reduce the spread by half. That would give us this picture:

That increased the power to 37%. Of course, we would have to know how many hits Johnny had on the test with 80 at bats. I would think he would be one tired baseball player by that point.

Josh hinted strongly that a “calculate power” problem is coming on the AP exam, and this is similar to the format that would be used. It isn’t that hard when you are looking at it like this. In fact, power makes a lot more sense.

I am going to put some caveats up front on this post. I am going to use stats vocab without explanation and absolutely reach above the AP Stats curriculum to make the understanding of the curriculum we teach more apparent.

With that said, let’s start with the Raina and Peter problem from 2012. Not all the problem, just the parts C and D. I have put the important info on the problem below.

Notice the adjustment to the mean given in the formula? That is a really important thing to realize. When you stratify, you can use the known properties of the population to adjust the mean and standard deviation to get more accurate results than if you didn’t adjust.

For instance, look at the problem as it is presented. Did either learner, Peter or Rania, sample more than the other? Did either one do more work? Did either one need more copies or more time interviewing people? No.

So as it stands, both people did exactly the same amount of work to get their results. Did they get equivalent results? Well, their means are in different places, but it seems okay so far, that could be due to sampling variation.

But if we look at their standard deviation something interesting happens, Rania’s SD is found through the formula: = 0.1979. Ok, things just changed radically.

Rania’s standard deviation is much, much smaller than Peter’s and it is ALL due to the fact she stratified the sample. This is what stratification gets you. Stratifying your sample lowers the SD of the stratified sample.

Let’s put that another way. You have two different samples; one is not stratified and one is stratified, which one has more POWER (yes, 1-beta, that power.)

If we use what we know from the AP curriculum, we know that increasing the sample size also increases power because it reduces the standard deviation. Wait, what’s that? Reducing the standard deviation of the sample increases power, so therefore stratifying a sample increases power also!

Oh snap. Things just got real.

Why is stratification part of the core of statistics? Because the practice of stratification increases the power of a sample, (and here is the big part) WITHOUT INCREASING THE SAMPLE SIZE. We don’t have to do more work to get the increase in power; we just have to think more.

But let’s extend this a little bit.

Rania chose the 60 females and 40 males in the problem because that matches the known proportion of the population. Is that the ideal sample to take to minimize the Standard Deviation?

In essence we want to graph the equation, and find the minimum of the function. At that point we will have the best value for how many men and women Rania should have asked to maximize power and minimize the SD.

A little help from our Desmos friends and we end up with:

And there we go. 55 women and 45 men will minimize the standard deviation and simultaneously maximize the power of the test to not make Type II errors. All for no more work than Peter did with his sample of 100 random people.

Stratification is good, has real results and is not just a word to teach as vocab. It has real value in understanding what we are doing in statistics.

——————————–

We need to have this caveat though: this only works when the characteristics of the population we are stratifying on and sampling from have different standard deviations. If the characteristics of the strata are the same, then we end up with what Peter did. We have to think hard about the independence condition part of the conditions check before we will KNOW it works.

On the topic of condition checks, Josh Tabor made a claim that bore some proving. He said that the conditions checks are not equal. In fact, some are essential and mandatory, and others are simply … well, optional.

For instance, the n < 10% of the population condition is completely optional and has not been required in the AP FRQ rubrics. Check it out. Totally true. The FRQ rubrics require the randomness condition and the large counts (or normality) condition, but not the n < 10% condition.

Mind blown. The textbook makes them all equal and all mandatory and doesn’t really explain why we do them, just that they are all required.

So here is the skinny on the conditions checks for z and t interval, why we check them, and the implications if the conditions are violated.

So let’s look at the formula for the z and t Confidence Interval and break them down.

1. The first part of the formula is the mean or point estimator for the interval. What is special about the point estimator? It is supposed to be an unbiased estimate of the mean.

Wait, what’s that? An unbiased estimator? How do we get the sample mean to be an unbiased estimate of the population mean? Oh, yea, we RANDOMLY SAMPLE from the population and calculate the mean and the CLT tells us the mean of a sample that is random is an unbiased estimator.

What happens if we do not have an unbiased estimator then? What if phat or xbar is in the wrong spot on the number line? We get an interval (or a test statistic) that is completely wrong! Garbage In, Garbage Out, and the VERY first check we have to do is a check for random sampling so we have confidence in the point estimator!

What happens if we violate this condition? Automatic failure of the interval or test. This condition is essential and vital to the process of doing statistics. Without an unbiased estimator, we have garbage going in.

2. Next up in the formulas we have the wonderful z-star and t-star. So what do we have to check to make sure those values are appropriate to use. How do we do that? What needs to be in place to make sure the z-star and t-star are appropriate?

Well, if the sample we collected is unimodal and symmetric, then we should be very comfortable using the z or t value for the appropriate interval or test. How can we be assured if the sample is normal? Well, if the sample is composed of proportions, making sure we have ENOUGH in our sample will make sure it is unimodal and symmetric (even normal!) How many is enough? Some books say np>5 some books say np>10, but the simplest idea is we are checking to make sure the sample is normal or big enough.

If our sample is composed of numerical quantities, then it is even easier. Graph the sample, look at it. Is it unimodal and symmetric? Good enough.

What happens if we violate this condition? Through several fathom sampling exercises, we discovered that if this condition is violated the z-star or t-star we use OVERESTIMATES the interval. We are claiming we are 95% confident, but in reality we are only 92% or 91%. That is bad. We are lying to people if we violate this condition. Not good. If we don’t check this, it is a failure.

3. Independence ….. Independence is tricky. For most of the problems we do, we find that n < 10% is enough to check. Does that guarantee that every person answered independently? What about experiments where independence is much much more difficult to ensure? What Josh and the editors of the book he helped author is saying that n< 10% is enough for the single sample cases.

Why? Because we cannot guarantee independence, but we can try to make sure it is there. But in the end, we don’t really care about the sample size as long it is done in a way to ensure independence.

Again, why? Because of what happens if we violate the sample size condition. Um, nothing. We can adjust for it if necessary, but the REALLY cool thing is that if we violate the sample size condition all we are doing is UNDERESTIMATING our confidence.

If we say that we are 95% confident and we have a sample size that is too big, what happens in reality is that we are 97% or 99% confident. We are lying, but we are lying to the GOOD side.

Which is why the AP exam grading rubric does not penalize learners for forgetting to check the third condition!

So to simplify and make it easy to understand, we have the following case:

The colors match up to the why of our conditions, and the condition 3 is really about independence, not just the n<10% but independence is very difficult and ends up being optional.

Besides, if we really want to, we can just do this:

And work in the Finite Population Correction Factor. This is not part of the AP curriculum, but it is nice to know WHY we check these, and WHAT happens when we fail to check them.

I know I will do a much better job teaching and making this understandable for learners knowing this.

A great question to understand why we check conditions in stats, “What happens we have a sample size greater than 10% of the population?” One of the themes of Tabor’s institute was what happens when we violate the conditions, and on day one we asked this question. Another way to phrase this question is, “Does the size of the population we are sampling from matter?”

To explore this question we started off with the Federalist paper exercise (a first week exercise in his class). This is very similar to the Gettysburg Address exercise focused on the Central Limit Theorem, where we are sampling from a population of words. The key to checking the condition is that the population we are sampling from is of a limited size.

In this case, we have a population size of 130 words, and we are sampling different sizes of samples.

 Sample size (n) SE of xbar of simulation samples 5 = 1.296 1.287 20 = 0.648 0.594 100 = 0.290 0.150 129 = 0.255 0.023 130 = 0.254 0

At a sample size of 5, is a good approximation of the simulation standard error we calculated. But notice, as we increase the sample size, the standard formula for the Central Limit Theorem breaks down. The difference between the two values grows wider and wider as the sample size increases. Clearly the CLT breaks down at some point and is no longer a good estimator of the standard deviation.

What we need to do is adjust the formula for the fact the sample size (n) is approaching the population size, or N. This adjustment is called the Finite Population Correction Factor and is  .

 Sample size (n) SE of xbar of simulation samples 5 = 1.296 1.287 1.272 20 = 0.648 0.594 0.596 100 = 0.290 0.150 0.139 129 = 0.255 0.023 0.022 130 = 0.254 0 0

Wow, notice now that the simulation approaches the corrected value! That is wonderful. But why the maximum sample size condition of n < 10% of the population? Let’s graph on the domain from 0 to 1 (since n/N will approach 1 as n approaches N) to see if that gains us any insights.

So why do we say that n < 10% of the population? Because between 0 and 10% there is only a 5% drop in the adjusted standard deviation, but between 0 and 20% there is an 11% drop (approximately). The curve drops off at a faster rate from there.

Why n < 10%? So we don’t have to worry about adjusting more than dividing by the root(n) and don’t have to worry about the Finite Population Correction Factor! I was blown away by this explanation. It solidifies so much about the reason why we check conditions for me.

But, and this is a very important but, there is more to come on the checking of conditions. Is this condition an important one? Not so much it turns out, and the reasons why are so informative to teaching and understanding stats.

The Fathom File used to generate the simulation of samples:

I spent the last week at the Silver State AP Conference this week with Josh Tabor as the instructor. First, let me just get this out of the way. If you ever have the chance to spend some time with him, do it. Do not pass Go, do not collect \$200, just go directly to the event. His knowledge of statistics and the pedagogy of teaching statistics is amazing. I will have several posts coming in the next day or so. Some of what I am posting will be pedagogy, some will be content, but all will be useful to the AP stats instructor (namely, me.)

I want to begin with some basic questions and formulations for asking question for the entire course. I think one of the best thing Josh did pedagogically was asking the same two question over and over again.

1. Is there evidence for <blank>?

2. Is there CONVINCING evidence for <blank>?

He started every course of discovery with these two questions. We would look at an AP question, or any question, and this is how it would start. Do we have evidence? Do we have convincing evidence? The next step, however, is evidence for what?

Let me lay out a problem for us to look at. I take a deck of cards and tilt it heavily or completely towards red cards. Make it look like it is a brand new deck, however, so the learners don’t initially think you are cheating. Then, tell them if they pull a black card out of the deck they will get a candy bar, or extra credit, or something.

The first person pulls, and of course, they get red. Aw shucks, no big deal though. Person 2, person 3, etc. At some point the class is going to accuse you of cheating. Of course they are, you ARE cheating, after all.

So what are our options that could be taking place?

1. The deck of cards is a fair deck and the learners are unlucky.

2. The deck of cards is an unfair deck and is Mr. Waddell is cheating.

These are the two options we have, and towards the end of the year we will recognize these are Null Hypothesis and Alternative Hypothesis statements, however at the beginning of the year (heck the first day of class!) these are easy and accessible statements to write down.

Next, I ask the class, do we have evidence for one of these statements? Yes, we clearly do have evidence that Mr. Waddell is cheating. 5 learners in a row got red cards.

Do we have CONVINCING evidence for one of these statements? Now, in the first week of class, we can have a discussion of what convincing means without getting into discussions of alpha or significant. We can think statistically without the math.

This line of questioning is repeated all year long on every question.

1. What are our two options?

2. Do we have convincing evidence for one of these two options?

And so begins the adventure and journey called AP Statistics. I will show more of this structure on questions to come.