by

Multiple hypothetical comparisons

You don’t actually have to do multiple comparisons to have a multiple comparisons problem — comparisons that you might hypothetically have done maybe if the data had been different perhaps will do it. I’ll show you.

Suppose that you’re conducting an exploratory study (don’t worry, you can make up a retrospective hypothesis when you publish it). You’ve measured a dozen or so variables — maybe your subjects performed a handful of tasks, or answered a few dozen questions, and the data are noisy, but after plotting you notice what appears to be a group difference on one of the variables, so you test it and observe a significant result. Your p-value is meaningless. This is why: The p-value is the probability that a value sampled from the null-distribution is greater than or equal to the observed value. But that’s not what you did; you didn’t sample a value from the null distribution, you sampled several values and examined the largest. What’s the probability that this value exceeds your critical value? Generally, pretty good.

If we assume that the null distribution is normal, the probabilities are easy to work out exactly. To get set up, suppose that we observe n variables, and conduct one-tailed tests with \alpha = .05.

# Parameters
x     <- seq(-5,5,length.out = 500)
alpha <- .05
crit  <- qnorm(1-alpha)
n     <- c(1,2,5,10,20,50)

We observe n variables, but after plotting, we choose the largest to test. Supposing the null hypothesis (a standard normal distribution), what is the distribution of the largest value? It’s easy to work out the CDF: letting \Phi be the standard normal CDF, if the largest value is x, then all n observations are less than x, which happens with probability

    \[      \Phi_n(x) = \Phi(x)^n \]

It follows (by the chain rule) that the density f_n is

    \[      f_n(x) = n\Phi(x)^{n-1}f(x) \]

where f is the standard normal density. This is the PDF of the largest value out of n samples from a standard normal distribution (that is, the PDF of the n‘th order statistic). It’s easy enough to implement these function in R:

# PDF/CDF for maximum order statistics
pdf.ord <- function(n, x, mu = 0, sigma = 1){
    n * pnorm(x,mu,sigma)^(n-1) * dnorm(x,mu,sigma)
}
cdf.ord <- function(n, x, mu = 0, sigma = 1){
    pnorm(x,mu,sigma)^n
}

We can plot this distribution for a few values of n

library(ggplot)
library(ggthemes)

# Data frame for plotting
p.dat <- lapply(n, 
            function(i) data.frame(Variables = i,
                                   X = x,
                                   Y = pdf.ord(n=i,x)))
plot.frame <- do.call(rbind, p.dat)

# Plot
ggplot(plot.frame, aes(x = X, y = Y, 
                       group = Variables, 
                       color = Variables)) +
    geom_line() +
    theme_hc() + 
    geom_vline(xintercept = crit,
               linetype   = 'longdash',
               color = 'red') +
    theme(axis.title.y = element_blank(),
          axis.text.y  = element_blank(),
          axis.ticks.y = element_blank())

You think your null distribution is standard normal (and compute your p-value accordingly), but your actual null distribution, for a given value of n, is shown above. The dashed line gives the critical value. See the problem?

The CDF makes it a bit clearer

# Data frame for plotting
p.dat <- lapply(n, 
           function(i) data.frame(Variables = i,
                                  X = x,
                                  Y = cdf.ord(n=i,x)))
plot.frame <- do.call(rbind, p.dat)

# Plot
ggplot(plot.frame, aes(x = X, y = Y, 
                       group = Variables, 
                       color = Variables)) +
    geom_vline(xintercept = crit, 
               linetype   = 'longdash',
               color      = 'red') +
    geom_line() +
    theme_hc() + 
    ylab('')

The nominal alpha of .05 actually induces a much more lenient threshold. In a study with only 10 measured outcomes, the plot-then-test-the-biggest approach already induces a true alpha of over .4 — almost a coin flip.

a.frame <- data.frame(Variables = n,
                      Alpha = 1 - sapply(n,
                                         pn.ord, x = crit))

ggplot(a.frame, aes(x = Variables, y = Alpha)) +
    geom_point(size = 3) +
    geom_line() +
    theme_hc() +
    ylab('True alpha\n(for nominal alpha = .05)')

The lesson is that p-values make specific assumptions about the way that the test-statistic was generated. P-values derived from tests which were conducted because of some feature of the data (e.g. because plotting suggested a group difference, or an interesting trend) are strictly uninterpretable.