Stop doing sequential testing with Bayes factors

I mentioned earlier that the whole “Bayesian inference is immune to stopping rules” excitement is largely overblown, but the simulations were based on pretty idealized situations. I thought it might be interesting to use summary statistics reported by an actual dataset to look at the performance of Bayes factors under optional stopping in semi-realistic conditions.

Field, Wagenmakers, Newell and Zeelenberg (2016) recently published a preregistered replication in which they compared two groups using a “Bayesian t-test”, and continued collecting data until the observed Bayes Factor exceeded a threshold. The authors state

For the interpretation of evidence in the Bayesian paradigm, the intention with which the data are collected is irrelevant; hence, the Bayes factor can be monitored as the data come in, and data collection can be terminated at any point

but the second statement does not follow from the first. When we use a Bayes factor to conduct sequential hypothesis tests, whether or not the interpretation of the quantity changes depending on the stopping plan (essentially, whether or not it obeys the likelihood principle) is irrelevant. What matters are the statistical properties of the test — and make no mistake: the Bayes factor, as far as it’s used in psychology, is a hypothesis test. As Sanborn and Hills (2014) have pointed out, the frequentist properties of the Bayes factor do actually depend on the stopping rule.


Field et al. obtained a set of “creativity” scores from subjects who had performed either a local or a global something, whatever, I don’t really know. As part of their preregistered design, they continued collecting data until the Bayes factor exceeded a threshold either for the null hypothesis (no mean difference), or the alternative. I wondered what would happen if we applied their procedure to data with similar summary statistics, but no true mean difference. Of course, I used the same threshold and the same priors for the Bayes factor.


Up to a maximum of 150 subjects in each group, I computed the cumulative probability of either remaining agnostic, accepting the null H0, or accepting the alternative H1 (note that this is a type 1 error). These probabilities are plotted below, showing the probability that you would have reached a particular decision by the time you reached a particular number of subjects.

Ouch. A type 1 error rate of 20 percent. This is not what we were promised.


You can download the Field et al. dataset at


# Load data and calculate summary statistics
data <- read.csv('JASPdata100.csv')
data$Condition <- as.factor(ifelse(data$Condition == 2, 'Local', 'Global'))
sum.statistics <- ddply(data, 'Condition', summarize,
                        M  = mean(BOC),
                        SD = sd(BOC),
                        SE = sd(BOC)/length(BOC))

# Simulation parameters
thresh  <- 3
num.sim <- 100  # Number of iterations
max.sub <- 150  # Subject limit (in each condition)
delay   <- 12   # When to start testing (how many in each condition)
mu      <- rep(mean(sum.statistics$M), 2)
sigma   <- rep(mean(sum.statistics$SD), 2)

# Simulate data <- matrix(rnorm(num.sim*max.sub, mean = mu[1], sd = sigma[1]),
                      nrow = num.sim, ncol = max.sub) <- matrix(rnorm(num.sim*max.sub, mean = mu[2], sd = sigma[2]),
                      nrow = num.sim, ncol = max.sub)

# Function to perform hypothesis test
bfTest <- function(n, c1, c2){
    bf <- ttestBF(x = c1[1:n], y = c2[1:n])

# Function to perform sequential test to a dataset
bfSeqTest <- function(c1, c2){
    n <- length(c1)
    bf <- sapply(2:n, function(i) bfTest(i, c1, c2))

# Do simulation <- t(sapply(1:num.sim, function(i) 

# Detect threshold passing
detectThresh <- function(x, thresh){
    n <- length(x)
    y <- rep('Agnostic', n)
    ind.H1 <- which(x > thresh)[1]
    ind.H0 <- which(x < 1/thresh)[1]
    if ( &{
    } else if (! {
        y[ind.H1:n] = 'H1'
    } else if (! {
        y[ind.H0:n] = 'H0'

# Compute cumulative decision probabilities
sim.decision <- t(apply(, 1, detectThresh, thresh = thresh))
d.agnostic <- colMeans(sim.decision == 'Agnostic')
d.H1 <- colMeans(sim.decision == 'H1')
d.H0 <- colMeans(sim.decision == 'H0')

# Plot
plot(0, xlim = c(1,max.sub), ylim = c(0,1),
     col = 'white', xlab = 'Number of Subjects',
     ylab = 'Probability of Decision')
grid(nx = NULL, ny = NULL)
lines(2:max.sub, d.agnostic, col = 'black', lwd = 2)
lines(2:max.sub, d.H1, col = 'red', lwd = 2)
lines(2:max.sub, d.H0, col = 'blue', lwd = 2)
abline(v = delay, lty = 2, lwd = 2)
legend(x = (2/3) * max.sub, y = 1,
       legend = c('Agnostic', 'H0', 'H1'),
       fill = c('black', 'blue', 'red'))


de Heide, R., & Gr├╝nwald, P. D. (2017). Why optional stopping is a problem for Bayesians. arXiv preprint arXiv:1708.08278.

Field, S. M., Wagenmakers, E. J., Newell, B. R., Zeelenberg, R., & van Ravenzwaaij, D. (2016). Two Bayesian tests of the GLOMO sys Model. Journal of Experimental Psychology: General, 145(12), e81.

Sanborn, A. N., & Hills, T. T. (2014). The frequentist implications of optional stopping on Bayesian hypothesis tests. Psychonomic bulletin & review, 21(2), 283-300.