Skip to content →

Bayes Factors and Stopping Rules

One can argue the choice of when to stop collecting data should be irrelevant to our choice of parameter estimates, since all information should be contained in the likelihood function. This principle underlies some criticisms of standard hypothesis testing, where (e.g.) the decision about whether or not we should conclude that a coin is biased based on N flips depends on whether we chose a priori to flip the coin N times, or whether we continued flipping until we obtained a significant result.

It’s been pointed out that “the interpretation of Bayesian quantities does not depend on the stopping rule” (Rouder, 2014). This is true, but I think it misses the point. The issue isn’t whether the interpretation of a Bayesian quantity changes, but how the decision rule (based on the Bayesian quantity) performs under optional stopping. I recently came across an article by Wetzels, Ravenzwaaij, & Wagenmakers (in press) advocating for the use of Bayes factors, saying “evidence, quantified by the Bayes factor, may be monitored as the data accumulate — data collection may stop whenever the evidence is conclusive”, which doesn’t sound quite right to me. Bayes factors are questionably Bayesian to begin with, and the stopping rule at work here (keep going until the Bayes factor crosses a threshold) doesn’t correspond to any Bayesian decision rule that I know of, so I did a few simulations to see what would happen if we applied it to some common research designs.

In the first example, I consider a simple hypothesis test for a normal mean. On each round of the simulation, I sampled from a standard normal distribution and computed both a standard two-tailed t-test (with \alpha = 0.05) and a Bayes factor comparing a null model with \mu = 0 with an alternative model \mu \neq 0 (the BF was computed with the BayesFactor package for R, using the default priors). I began with 2 observations, and then increased the sample sample size until the null hypothesis was rejected (for the p-value) or until the BF accepted one of the two models (defined as a Bayes factor > 3 for the alternative model, or < 1/3 for the null model). Below, I plot the proportion of each decision at each sample size from 1 to 1000.


What does this tell us? If we keep collecting data until we achieve a significant result, then the probability that a p-value returns a false positive increases as the sample size increases (at will approach 1 as the sample size approaches infinity). This is not surprising, since a standard hypothesis test cannot accept the null hypothesis. We simply continue collecting data until the null is rejected. The same is not true of the Bayes factor, since we now have the capacity to accept the null model, but we still (falsely) accept the alternative model 20\% of the time.

What about something more complicated? I repeated the simulation with some regression data, consisting of of a single outcome y and five predictors (x_1, x_2, x_3, x_4, x_5). The variables were drawn from independent standard normal distributions, so that there was, in truth, no relationship between any of the variables. In the first simulation, we stop either when the regression itself is significant (\alpha = 0.05), or when the Bayes factor comparing the full model to the null (interaction only) model exits ([1/3, \ 3]).


The results are much the same as the previous example, though slightly better for the Bayes factor (only a 10\% type 1 error rate). What if we collect until we find any effect? (Another common research tactic). In this case, we continue sampling until at least one variable is significant, or until one of the Bayes factors comparing y = x_0 + x_i to y = x_0 exceeds the threshold.


Even worse. The Bayes factor is increasingly likely to false accept the alternative as the sample size increases, plateauing at a type 1 error rate of 40\%.

The lesson here is that the problem with optional stopping is the decision rule, not Bayesian vs. frequentist statistics. The frequentist properties of the Bayes factor (e.g. type 1 error rate) may be better than those of the p-value, but they are still very bad. This nicely illustrates one of the many problems I have with the introduction of Bayesian statistics to psychology, which is that it has not changed the way that psychologists do research or analyze their data at all. In fact, is has convinced them that they need to worry even less about those things than before. Don’t worry about testing real, sensible hypotheses, the Bayes factor will let you test straw-man point-nulls without the guilt. Don’t worry about small samples, noisy measurements, or the garden of forking paths, and don’t worry about optional stopping, Bayesian statistics is immune to those kinds of problems.


Wetzels, R., van Ravenzwaaij, D., & Wagenmakers, E.-J. (in press). Bayesian analysis. In R. Cautin, & S. Lilienfeld (Eds.), The Encyclopedia of Clinical Psychology. Wiley-Blackwell.

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic bulletin & review, 21(2), 301-308.

Published in Blog

Comments are closed.