Jun 292013
 

A great question to understand why we check conditions in stats, “What happens we have a sample size greater than 10% of the population?” One of the themes of Tabor’s institute was what happens when we violate the conditions, and on day one we asked this question. Another way to phrase this question is, “Does the size of the population we are sampling from matter?”

To explore this question we started off with the Federalist paper exercise (a first week exercise in his class). This is very similar to the Gettysburg Address exercise focused on the Central Limit Theorem, where we are sampling from a population of words. The key to checking the condition is that the population we are sampling from is of a limited size.

In this case, we have a population size of 130 words, and we are sampling different sizes of samples.

 

Sample size (n)

image

SE of xbar of simulation samples

5

image= 1.296

1.287

20

image= 0.648

0.594

100

image= 0.290

0.150

129

image= 0.255

0.023

130

image= 0.254

0

At a sample size of 5, image is a good approximation of the simulation standard error we calculated. But notice, as we increase the sample size, the standard formula for the Central Limit Theorem breaks down. The difference between the two values grows wider and wider as the sample size increases. Clearly the CLT breaks down at some point and is no longer a good estimator of the standard deviation.

What we need to do is adjust the formula for the fact the sample size (n) is approaching the population size, or N. This adjustment is called the Finite Population Correction Factor and is  image.

Sample size (n)

image

SE of xbar of simulation samples

image

5

image= 1.296

1.287

1.272

20

image= 0.648

0.594

0.596

100

image= 0.290

0.150

0.139

129

image= 0.255

0.023

0.022

130

image= 0.254

0

0

 

Wow, notice now that the simulation approaches the corrected value! That is wonderful. But why the maximum sample size condition of n < 10% of the population? Let’s graph image on the domain from 0 to 1 (since n/N will approach 1 as n approaches N) to see if that gains us any insights.

clip_image022

So why do we say that n < 10% of the population? Because between 0 and 10% there is only a 5% drop in the adjusted standard deviation, but between 0 and 20% there is an 11% drop (approximately). The curve drops off at a faster rate from there.

Why n < 10%? So we don’t have to worry about adjusting more than dividing by the root(n) and don’t have to worry about the Finite Population Correction Factor! I was blown away by this explanation. It solidifies so much about the reason why we check conditions for me.

But, and this is a very important but, there is more to come on the checking of conditions. Is this condition an important one? Not so much it turns out, and the reasons why are so informative to teaching and understanding stats.

The Fathom File used to generate the simulation of samples:

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)