A note on the CLT

Introductory statistics courses (particularly when they’re taught outside of the statistics department) often gloss over the details of the central limit theorem, describing it only as something that let’s you do t-tests without worrying about normality. I recently came across a very good example in which the central limit theorem doesn’t behave as you might naively expect it to, and shows why you need to exercise a little of bit of thought before invoking it as justification for whatever normality assumption you’re making in your analysis.

The central limit theorem (there are several, but the most common of them) says that if \{x_i\} is a random sample of size n from a distribution with mean \mu and finite variance \sigma^2, then the sample mean \bar{X} satisfies

    \[                \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathrm{N}(0,1)         \]

In other words, as n tends towards infinity, the distribution of the sample mean approaches a normal distribution with mean \mu and variance \sigma^2/n. This is the basis behind, say, standard null-hypothesis testing: If our sample of size n comes from a distribution with mean 0; then our sample mean should come from a distribution that looks roughly like \mathrm{N}(0,\sigma/\sqrt{n}). If the mean is very unlikely to have been generated from this distribution, we “reject the null hypothesis” that the true mean is zero.

The CLT doesn’t say that \bar{X} is normally distributed, though, even for large sample sizes. It says that its distribution becomes closer to normal as n increases. How fast it becomes “close” depends on the distribution from which X was drawn, and people tend to assume that n = 1000 is big just because it takes a long time to count to. But 1000 is just as far from \infty as 10, so there’s no real reason to think of it as “large” from the perspective of the CLT.

Below are a few examples where even “large” samples don’t help you. They’re exaggerations in the sense that the distributions involved maybe aren’t likely to occur in practice, but think of them as consciousness raising exercises.

Means from a fat-tailed distribution

We’ve done an experiment and collected some data, and now we want to know if the mean is different from zero. Our sample doesn’t look quite normal, so we do some outlier removal and trust the CLT to take care of the rest. But suppose that the distribution underlying the sample isn’t normal, but has fat tails (exactly the kind of situation in which you would expect outliers); then the sampling distribution of the mean can still be highly non-normal even for large sample sizes.

Let’s simulate some data from a fat-tailed distribution. Below, I’ve drawn 10,000 samples each of size 10, 100, 1000, and 10,000 from a \mathrm{t}_{2.1}-distribution to get a sense of what the sampling distributions look like. This distribution has finite mean and variance, so the CLT applies, but look at the qq-plots below:
CLT_mean

The sampling distribution of the mean still has very fat tails, even for a sample size of 10,000! Any confidence intervals or p-values are likely to be off.

Fat-tailed errors

What happens if we fit a simple linear regression model to a dataset with fat-tailed errors (again, likely to appear in practice as lots of “outliers”)? When we calculate p-values and confidence intervals, we assume that the estimates are approximately normally distributed for large samples (“large” usually meaning around 30). So again, I’ve generated 10,000 samples from a simple linear regression model with intercept 0 and slope 1, with sample sizes of 10, 100, 1000 and 10,000. Here are the sampling distributions for the slope parameter:
CLT_regression

Again, fat-tails even for extremely large sample sizes. Any p-values or confidence intervals will be misleading.

Sure, in practice your data probably don’t come from a distribution with tails as fat as the \mathrm{t}_{2.1}, but you probably don’t have 10,000 participants either, and even less extreme deviations from normality can mess up your sampling distributions for an n of 30. I’ve seen datasets with “outlier problems” that could just as easily be interpreted as an underlying fat-tailed distribution, and that’s exactly the kind of situation where you would expect these problems to happen.