Let’s construct the z-test.

Suppose, for argument, that I offer you a wager. I open up the terminal on my computer and enter a command to generate a set of random numbers from a normal distribution. I tell you the sample size and variance arguments that I input into the command, but not the mean. For concreteness, suppose that I choose a sample size of and a variance of . You thus know that I entered a command that looks something like

normal_rng(N = 20, mean = _, var = 2)

The wager is this: I tell you that I have chosen to enter either `mean = 0`

, or `mean = something else`

, but I don’t tell you which. Instead, I give you the output of the random number generator, and challenge you to guess. How would you choose?

One naive strategy would be to calculate the mean of the output and guess `mean = something else`

if the mean is different from zero, but this is clearly unreliable, as the sample mean would be expected to be slightly different from zero even if I *did* enter `mean = 0`

. Another, more reasonable strategy, would be to say that you’ll choose `mean = something else`

if the sample mean is very far from zero. After all, `mean = 0`

would probably produce something *reasonably close* to zero. But how far is far?

Let’s setup some notation. Let be the output of the random number generator, let be its mean, and let denote that a particular variable has a normal distribution with mean and variance . What would you expect to look like if I entered `mean = 0`

? Well, you can use the following fact:

What this means is that, if I chose to enter `mean = 0`

, then the must have come from a normal distribution with mean and variance . If does not appear to have come from such a distribution — if it is very far outside of the “bulk” of the bell shape — then you guess `mean = something else`

. How far is very far? Easy. We know what the distribution would be for `mean = 0`

, so we just calculate the probability that we would get a value of as far out (or further) than we got. This probability is the p-value.

Calculating these probabilities is difficult since they have no closed form expression, so you decide to carry around a table with the probabilities associated with every value of , but this requires a separate table for every value of , which is unwieldy. Instead, you use the following convenient fact:

In other words, we can shift and scale normal distributions, and they stay normal. If I used `mean = 0`

, you know that , so you can then define

And voila. You only need to carry around a table for a standard normal distribution.

This procedure (or minor variants on it) is, by far, the most widely used inferential tool in applied statistics, and an enormous body of literature has been written on its misuse and misinterpretation (for example, as I’ve discussed before, there is strong evidence that the overwhelming majority of scientists cannot correctly interpret a p-value).

Lately, I’ve been thinking about kind of misuse that is extremely common and widely recognized, but that I’ve had trouble clearly articulating or putting onto any kind of rigorous statistical foundations. Not multiple comparisons per se, but the thing that happens when you start performing hypothesis tests on a corner of a dataset because that corner looks interesting. This is the kind of thing that Andrew Gelman (2013) would call “The garden of forking paths”, and he comes closer to articulating what I’ve been thinking than anyone, I think.

Now, a digression: the likelihood principle is a quasi-axiom in statistical inference that says, specifically, that all information contained in the data about a statistical model or parameter is contained in the likelihood function. A not-quite-accurate way to summarize this, but a useful way to think about it, is to say that statistical inference should depend on the data, and not on extraneous factors that are, intuitively, irrelevant to inference (like the researcher’s mental state during the data collection process).

The standard illustration of this concept is the case of two experimenters each testing whether a coin is fair, or whether it is biased towards tails. Both experimenters flip the coin six times, and observe a single HEADS. Both experimenters have identical data. Both experimenters decide to conduct a significance test. The first experimenter chose in advance to flip the coin six times, suggesting a binomial distribution and giving a p-value of . Significant. The second experimenter decided to continue flipping until observing HEADS, stopping on the sixth flip. This implies a geometric distribution, and gives a p-value of about . Non-significant. Both experimenters have exactly the same data, and yet hypothesis testing returns conflicting results because the researchers chose to collect their data differently. This is to say that significance testing violates the likelihood principle. As a side note, both models imply identical likelihood functions, and so, for example, Bayesian inference would draw the same conclusion from either dataset.

To keep this tied to something tractable, let’s use a very concrete example. Suppose that I perform an experiment in which I measure two variables (A and B) across a handful of subjects. After plotting the data, I notice that one of the two variables seems to be particularly far from zero, on average. So I do a z-test and obtain a significant result. On one hand, this result should stand on its own — I obtained data, and I performed a test, and I computed a p-value with a well defined meaning: If the true value of the mean were zero, then the probability that I would observe a mean at least this large is . On the other hand, the construction of the hypothesis test implicitly (well, by construction) assumed that my test statistic (in this case, the mean) was sampled *at random* from the null distribution. But I didn’t sample it at random, I chose it by observing that the data appeared to show a large effect. It’s actually not clear how the result of the test should be interpreted in this case, since my test statistic (and thus my p-value) was not sampled according to the assumptions of the test.

Suppose that two researchers both examine the data. The first doesn’t create any plots, and simply tests variable A using a z-test against the null distribution . The second plots the data and observes that variable A seems the more interesting (and largest) of the two. What is the appropriate test? If the null hypothesis is true, and there are no true effects anywhere in the data, then the mean of variable A is the *largest of two random draws from the null distribution*. The density of the largest of two observations, under the null hypothesis, is then , where and are the standard normal CDF and PDF, respectively. The “true” null distributions for each researcher are plotted, along with the critical values for a one-sided z-test with .

So a nominal significance threshold of is actually more lenient for the second researcher than for the first — it actually corresponds to a “true” type I error rate of about . So even in the simple case of two possible comparisons, peeking at the data and choosing to perform a test on the most interesting variable increases the type I error rate by a factor of two. In the case of a complex, high-dimensional dataset, it’s impossible to quantify the effect of researcher decisions on the properties of the test.

Of course, one could easily take the position that type I and II errors don’t exist or aren’t important, and that one should consider rather what Gelman calls type M and type S errors (Gelman and Tuerlinckx, 2000); that is, the probability of over or underestimating the effect (magnitude errors), or mistaking it’s direction (a sign error). But that doesn’t solve the fundamental problem, which is that standard hypothesis testing makes strong assumptions about the way the test statistic is generated, which are commonly violated in practice.

### References

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.

Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics, 15(3), 373-390.