Chapter 14 Distribution of the Sample Mean
14.1 The Sample Mean as a Random Variable
Interactive Example: Flipping a Coin
In this example, we’ll simulate flipping a coin \(N\) times and calculating the average \({\bar{Y}}_{N}\). We are interested in the distribution of \(\bar{Y}_{N}\) under repeated sampling.
We’ll represent any single flip of the coin as a random variable \(Y\) that can take on values Heads (\(Y=1\)) or Tails (\(Y=0\)). As we’ve seen before, \(Y \sim \mbox{Bernoulli}(\pi)\), where a fair coin has \(\pi =.5\).
For a single sample, our virtual coin is flipped \(N\) times. The default is \(N=5\). A row in the table in the main panel represents a single sample of \(N\) flips of the coin. The observed outcomes for the coin flips are shown as \(Y1\) through \(YN\). The sample mean \(\bar{Y}_{N}=\sum_{i=1}^N Y_i\) is calculated and shown in the last column.
Again, we are assuming a world of repeated sampling, so we want to generate multiple samples of \(N\) coin flips each. In the default case, 10 samples are generated. You can think of this as 10 different people flipping the coin \(N\) times, recording their flips of the coin in each row, and calculating their sample means. The table in the main panel shows the results for each sample that is generated, including each sample’s (i.e., each row’s) mean. Take a second to look at the table below and make sure you understand what each row represents.
There are a few parameters you can change in the left panel. As you increase the sample size \(N\), each row will contain more flips of our virtual coin. Notice that as you do so, the possible values for the sample means change. As you increase \(N\), \(\bar{Y}_N\) has a larger number of possible values. For a relatively small number of samples (e.g., 10), you will only see a portion of those possible values. To see more, increase the number of samples generated (last slider in the left panel).
If you click on the Plot tab, you will see a plot of the proportion of times each \(\bar{Y}_N\) value appears in the set of simulated samples. Increase the number of samples generated to 100,000 to see a better approximation of the true sampling distribution for \(\bar{Y}_N\). While in the Plot tab and with the number of samples generated set to 100,000, change the sample size \(N\) to see how the sampling distribution of \(\bar{Y}_N\) changes as the sample size \(N\) changes.
14.2 Larger (Random) Samples are Better
Interactive Example: Sampling Distributions
In this example, we’ll compare the distribution \(f(y)\) of a random variable \(Y\) to the distribution \(f(\bar{y}_N)\) of the sample mean \({\bar{Y}}_{N}\). Various distributions are available in left panel drop-down. The main panel will show the PDF \(f(y)\) for that distribution, the expected value \(E(Y)\), and a plot of \(f(y)\) on the right. For most of the distributions listed in the drop-down, you can also change a parameter (or two) in the left panel.
In the lower part of the left panel, there are sliders for the sample size \(N\) and for the number of samples generated. The sample size \(N\) determines how many values are drawn from \(f(y)\) in order to calculate a single sample mean \({\bar{Y}}_{N}\). The number of samples generated slider allows for values of 1,000, 100,000 (default), and 1,000,000. This determines the number of samples (of size \(N\)) to simulate and, therefore, the number of times \({\bar{Y}}_{N}\) is calculated.
A plot of the distribution of the simulated values of \({\bar{Y}}_{N}\) is shown on the right. This distribution is denoted as \(\tilde{f}(\bar{y}_N)\) because it is an approximate distribution based on the simulated samples. 100,000 samples usually renders a fairly good approximation. For a better approximation, increase the number of samples generated to 1,000,000. Note, however, that the app will take quite a bit longer to update when you are generating 1,000,000 samples each time.
One important part of this example is to watch what happens to the sampling distribution \(\tilde{f}(\bar{y}_N)\) as you change \(N\). First, choose a probability distribution from the drop-down. Leave the default parameters as they are, but set \(N=1\). This represents sampling from \(f(y)\) one value at a time. The graph of \(\tilde{f}(\bar{y}_N)\) should look similar to that of \(f(y)\). Now increase the sample size to \(N=2\). How does the plot of \(\tilde{f}(\bar{y}_N)\) change? Increase the sample size by one step, and then another, and watch how the distribution of \(\bar{Y}_N\) changes. As you increase the sample size \(N\), three things should happen: (1) the distribution should become increasingly centered over \(Y\)’s expected value (the blue line), (2) the variance should shrink, and (3) once the sample size is around \(N=50\) or 100, the distribution of \(\bar{Y}_N\) should look increasingly like a Normal distribution.
That’s one very important part of this example. Notice, however, that I didn’t mention which distribution you should choose in the drop-down. Now go through the same steps for another distribution. Then try another. Try one more after that. The second important point to this example is that, regardless of which distribution we choose for the underlying random variable \(Y\), the distribution \(\tilde{f}(\bar{y}_N)\) of the sample mean \(\bar{Y}_N\) will behave in the same manner as just described. For any of the drop-down distributions (Bernoulli, Uniform, Normal, Exponential, Poisson, or Weird), as you increase the sample size \(N\), three things should happen: (1) the distribution should become increasingly centered over \(Y\)’s expected value (the blue line), (2) the variance should shrink, and (3) once the sample size is around \(N=50\) or 100, the distribution of \(\bar{Y}_N\) should look increasingly like a Normal distribution.
Interactive Example: The Law of Large Numbers and the Central Limit Theorem
In this example, we’ll visualize the LLN and the CLT at work. This example is longer than the others and contains multiple parts. I recommend reading a paragraph or two and then jumping to the app and plots to make sure you understand what’s being displayed. Then come back and move to the next paragraph or two.
As in the previous interactive example, the left panel allows you to choose from a handful of probability distributions and change their parameters. The distribution \(f(y)\), expected value \(E(Y)\), and variance \(V(Y)\) are displayed in the main panel, along with a plot of \(f(y)\). The blue line in the upper plot shows the value of \(E(Y)\).
The lower section of the main panel shows numerous sample means \(\bar{Y}_N\) plotted as a function of the sample size \(N\). For each point, a sample of size \(N\) is drawn from the distribution \(f(y)\) and the sample mean \(\bar{Y}_N\) is then calculated. Darker gray or black areas indicate more means at or near that value. In the default (i.e., initial) case, \(Y \sim \mbox{Bernoulli}(.5)\), the maximum sample size is \(N=200\), 10,000 random samples of varying sizes (\(N=1\) to \(N=200\)) are generated, the mean is calculated for each sample, and the values of \(\bar{Y}_N\) are plotted as a function of \(N\).
We want to examine how the distribution of the sample average \(\bar{Y}_N\) changes as \(N\) increases. Imagine taking a vertical slice through the plot, say at \(N=100\). That slice represents the distribution of \(\bar{Y}_{100}\) values — i.e., sample means when we have samples of \(N=100\). That slice of \(\bar{Y}_{100}\) values will have an expected value \(E(\bar{Y}_{100})\) and a variance \(V(\bar{Y}_{100})\).
Law of Large Numbers
Let’s compare the far left side of the plot (samples of \(N=1\) or \(N=2\)) to the far right of the plot (samples of \(N=199\) or \(N=200\)). Notice how spread out the sample averages are when we have small samples. As the sample size \(N\) increases — i.e. as we move to the right in the plot — two things happen: (1) the spread (or variance) of the averages decreases and (2) the distribution is increasingly concentrated around the blue line, the expected value \(E(Y)\) of the underlying distribution \(f(y)\). This is the Law of Large Numbers in action.
The last two sliders in the side panel allow you to change (1) the maximum sample size \(N\) shown in the lower plot and (2) the number of samples (i.e., dots) shown in the plot. Keep the number of samples generated at 10,000, but change the maximum sample size to \(1000\). You should see a similar trend as before. As we increase the sample size, the sample means are more and more concentrated closer to \(E(Y)\). In theory, we could continue doing this for even larger samples. The LLN tells us that as we increase the sample size \(N\), the sample mean converges to the expected value of \(Y\).
Try changing the distribution selection in the drop-down menu, the maximum sample size, and the number of samples generated. In each case, you should see the same behavior in the distribution (or slice) of \(\bar{Y}_N\) as \(N\) increases.
Central Limit Theorem
Before you use this example to explore the CLT, note that you can use the previous interactive example to do so as well — and perhaps more intuitively. In the previous interactive example, if you pick any distribution in the drop-down and set the sample size to anything above \(N=50\), you should see a sampling distribution \(\tilde{f}(\bar{y}_N)\) that is approximately Normal. Take a few minutes to explore the previous interactive example until you feel like you understand how the CLT is being demonstrated there.
Once you’ve done that, we can see the CLT at work in this example. Pick any distribution from the drop-down menu. Choose any values for the distribution parameters. One caution: if you choose the Bernoulli, set \(\pi\) to a value between .2 and .8. Finally, set the maximum sample size to \(N=200\).
Again, think of taking a slice through the main panel lower scatterplot at a particular value of \(N\). The dots for that slice are the sample means corresponding to samples of size \(N\). As we’ve seen, the sample mean \(\bar{Y}_N\) for a given \(N\) has a distribution. In the scatterplot, for a given slice of \(N\), imagine the density \(f(\bar{y}_N)\) coming out of the page (or monitor) and that we’re looking down on the density.
The CLT tells us that as \(N\) increases, the sample mean converges to \[\bar{Y}_N \sim \mbox{Normal}\left[ E(Y), \, \frac{V(Y)}{N}\right]\] We’ll use \(N=50\) as a cutoff for when the CLT seems to have kicked in or to be a “good enough” approximation. When \(N>50\), the CLT implies
1. that sample means should be centered around \(E(Y)\),
2. that \(V(\bar{Y}_N) = V(Y)/N\) should decrease as the sample size \(N\) increases, and
3. that \(\bar{Y}_N\) is Normally distributed.
The first two are evident in the scatterplot in the main panel. Try changing the distribution. You should still see the same behavior.
To demonstrate the third point — that for larger samples, \(\bar{Y}_N\) is specifically Normally distributed — notice the red lines for \(N>50\). The red lines are calculated as \[\begin{align*} \mbox{Upper} &= E(Y) + 2.576 \sqrt{V(Y)/N} \\ \mbox{Lower} &= E(Y) - 2.576 \sqrt{V(Y)/N} \end{align*}\] The 2.576 values are those specifically associated with a Normal distribution and 99% of the probability under the Normal density. According to the CLT, for \(N>50\), about 99% of the values of \(\bar{Y}_N\) (i.e., dots) should fall between the upper and lower red lines. Try changing the maximum sample size and the distribution. Typically, only about 1% of the dots should fall outside the red bounds.
Summary
Practice Session: Distribution of the Sample Mean
This practice session is fairly straightforward. You are told that you have \(N\) observations from a random variable \(Y\). You are also told either the distribution of \(Y\) or its expected value and variance. Based on that, you are asked a question about the distribution of the sample mean \(\bar{Y}_N\). Use the Central Limit Theorem (CLT) and the information given about \(Y\) to answer the question. Try at least ten problems, or until you feel comfortable answering the questions.
Practice Session: Probability Calculations using the Sample Mean
This practice session is similar to the previous one. However, here, you are asked to take an additional step. As before, you are told that you have \(N\) observations from a random variable \(Y\). You are also told either the distribution of \(Y\) or its expected value and variance. You are then asked to calculate the probability of observing a sample mean \(\bar{Y}_N\) that is less than some value, greater than some value, or falls within a specific interval.
As you did in the previous practice session, you will need to use the Central Limit Theorem (CLT) and the information given about \(Y\) to determine the distribution of \(\bar{Y}_N\). Once you’ve done that, use R’s pnorm() command to calculate the requested probability. Try at least ten problems, or until you feel comfortable answering the questions.