A great question to understand why we check conditions in stats, “What happens we have a sample size greater than 10% of the population?” One of the themes of Tabor’s institute was what happens when we violate the conditions, and on day one we asked this question. Another way to phrase this question is, “Does the size of the population we are sampling from matter?”

To explore this question we started off with the Federalist paper exercise (a first week exercise in his class). This is very similar to the Gettysburg Address exercise focused on the Central Limit Theorem, where we are sampling from a population of words. The key to checking the condition is that the population we are sampling from is of a limited size.

In this case, we have a population size of 130 words, and we are sampling different sizes of samples.

 Sample size (n) SE of xbar of simulation samples 5 = 1.296 1.287 20 = 0.648 0.594 100 = 0.290 0.150 129 = 0.255 0.023 130 = 0.254 0

At a sample size of 5, is a good approximation of the simulation standard error we calculated. But notice, as we increase the sample size, the standard formula for the Central Limit Theorem breaks down. The difference between the two values grows wider and wider as the sample size increases. Clearly the CLT breaks down at some point and is no longer a good estimator of the standard deviation.

What we need to do is adjust the formula for the fact the sample size (n) is approaching the population size, or N. This adjustment is called the Finite Population Correction Factor and is  .

 Sample size (n) SE of xbar of simulation samples 5 = 1.296 1.287 1.272 20 = 0.648 0.594 0.596 100 = 0.290 0.150 0.139 129 = 0.255 0.023 0.022 130 = 0.254 0 0

Wow, notice now that the simulation approaches the corrected value! That is wonderful. But why the maximum sample size condition of n < 10% of the population? Let’s graph on the domain from 0 to 1 (since n/N will approach 1 as n approaches N) to see if that gains us any insights.

So why do we say that n < 10% of the population? Because between 0 and 10% there is only a 5% drop in the adjusted standard deviation, but between 0 and 20% there is an 11% drop (approximately). The curve drops off at a faster rate from there.

Why n < 10%? So we don’t have to worry about adjusting more than dividing by the root(n) and don’t have to worry about the Finite Population Correction Factor! I was blown away by this explanation. It solidifies so much about the reason why we check conditions for me.

But, and this is a very important but, there is more to come on the checking of conditions. Is this condition an important one? Not so much it turns out, and the reasons why are so informative to teaching and understanding stats.

The Fathom File used to generate the simulation of samples: